All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
@ 2017-03-07 15:48 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Darrick J. Wong, Heiko Carstens, Michal Hocko

Hi,
this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
version of this patch series was posted as an RFC
http://lkml.kernel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
Since then I have reconsidered the semantic and made it a counterpart
to the __GFP_NORETRY and made it the other extreme end of the retry
logic. Both are not invoking the OOM killer so they are suitable
for allocation paths with a fallback. Also a new potential user has
emerged (kvmalloc - see patch 4). I have also renamed the flag from
__GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.

I have kept the RFC status because of the semantic change. The patch 1
is an exception because it should be merge regardless of the rest.

The main motivation for the change is that the current implementation of
__GFP_REPEAT is not very much useful.

The documentation says:
 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 *   _might_ fail.  This depends upon the particular VM implementation.

It just fails to mention that this is true only for large (costly) high
order which has been the case since the flag was introduced. A similar
semantic would be really helpful for smal orders as well, though,
because we have places where a failure with a specific fallback error
handling is preferred to a potential endless loop inside the page
allocator.

The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
users so only those which might use larger orders have stayed. One user
which slipped through cracks is addressed in patch 1.

Let's rename the flag to something more verbose and use it for existing
users. Semantic for those will not change. Then implement low (!costly)
orders failure path which is hit after the page allocator is about to
invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
and finally can tell try as hard as possible without the OOM killer.

Xfs code already has an existing annotation for allocations which are
allowed to fail and we can trivially map them to the new gfp flag
because it will provide the semantic KM_MAYFAIL wants.

kvmalloc will allow also !costly high order allocations to retry hard
before falling back to the vmalloc.

The patchset is based on the current linux-next.

Shortlog
Michal Hocko (4):
      s390: get rid of superfluous __GFP_REPEAT
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
      xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
      mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes

Diffstat
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 arch/s390/mm/pgalloc.c                       |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 fs/btrfs/check-integrity.c                   |  2 +-
 fs/btrfs/raid56.c                            |  2 +-
 fs/xfs/kmem.h                                | 10 +++++++++
 include/linux/gfp.h                          | 32 +++++++++++++++++++---------
 include/linux/slab.h                         |  3 ++-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 ++--
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 14 +++++++++---
 mm/sparse-vmemmap.c                          |  4 ++--
 mm/util.c                                    | 14 ++++--------
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 +++----
 net/core/dev.c                               |  6 +++---
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 27 files changed, 78 insertions(+), 53 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
@ 2017-03-07 15:48 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Darrick J. Wong, Heiko Carstens, Michal Hocko

Hi,
this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
version of this patch series was posted as an RFC
http://lkml.kernel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
Since then I have reconsidered the semantic and made it a counterpart
to the __GFP_NORETRY and made it the other extreme end of the retry
logic. Both are not invoking the OOM killer so they are suitable
for allocation paths with a fallback. Also a new potential user has
emerged (kvmalloc - see patch 4). I have also renamed the flag from
__GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.

I have kept the RFC status because of the semantic change. The patch 1
is an exception because it should be merge regardless of the rest.

The main motivation for the change is that the current implementation of
__GFP_REPEAT is not very much useful.

The documentation says:
 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 *   _might_ fail.  This depends upon the particular VM implementation.

It just fails to mention that this is true only for large (costly) high
order which has been the case since the flag was introduced. A similar
semantic would be really helpful for smal orders as well, though,
because we have places where a failure with a specific fallback error
handling is preferred to a potential endless loop inside the page
allocator.

The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
users so only those which might use larger orders have stayed. One user
which slipped through cracks is addressed in patch 1.

Let's rename the flag to something more verbose and use it for existing
users. Semantic for those will not change. Then implement low (!costly)
orders failure path which is hit after the page allocator is about to
invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
and finally can tell try as hard as possible without the OOM killer.

Xfs code already has an existing annotation for allocations which are
allowed to fail and we can trivially map them to the new gfp flag
because it will provide the semantic KM_MAYFAIL wants.

kvmalloc will allow also !costly high order allocations to retry hard
before falling back to the vmalloc.

The patchset is based on the current linux-next.

Shortlog
Michal Hocko (4):
      s390: get rid of superfluous __GFP_REPEAT
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
      xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
      mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes

Diffstat
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 arch/s390/mm/pgalloc.c                       |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 fs/btrfs/check-integrity.c                   |  2 +-
 fs/btrfs/raid56.c                            |  2 +-
 fs/xfs/kmem.h                                | 10 +++++++++
 include/linux/gfp.h                          | 32 +++++++++++++++++++---------
 include/linux/slab.h                         |  3 ++-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 ++--
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 14 +++++++++---
 mm/sparse-vmemmap.c                          |  4 ++--
 mm/util.c                                    | 14 ++++--------
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 +++----
 net/core/dev.c                               |  6 +++---
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 27 files changed, 78 insertions(+), 53 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
  2017-03-07 15:48 ` Michal Hocko
@ 2017-03-07 15:48   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Heiko Carstens

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

page_table_alloc then uses the flag for a single page allocation. This
means that this flag has never been actually useful here because it has
always been used only for PAGE_ALLOC_COSTLY requests.

An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
superfluous __GFP_REPEAT") has missed this one but the situation is very
same here.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/mm/pgalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 995f78532cc2..2776bad61094 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -144,7 +144,7 @@ struct page *page_table_alloc_pgste(struct mm_struct *mm)
 	struct page *page;
 	unsigned long *table;
 
-	page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+	page = alloc_page(GFP_KERNEL);
 	if (page) {
 		table = (unsigned long *) page_to_phys(page);
 		clear_table(table, _PAGE_INVALID, PAGE_SIZE/2);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
@ 2017-03-07 15:48   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Heiko Carstens

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

page_table_alloc then uses the flag for a single page allocation. This
means that this flag has never been actually useful here because it has
always been used only for PAGE_ALLOC_COSTLY requests.

An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
superfluous __GFP_REPEAT") has missed this one but the situation is very
same here.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/mm/pgalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 995f78532cc2..2776bad61094 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -144,7 +144,7 @@ struct page *page_table_alloc_pgste(struct mm_struct *mm)
 	struct page *page;
 	unsigned long *table;
 
-	page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+	page = alloc_page(GFP_KERNEL);
 	if (page) {
 		table = (unsigned long *) page_to_phys(page);
 		clear_table(table, _PAGE_INVALID, PAGE_SIZE/2);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-03-07 15:48 ` Michal Hocko
@ 2017-03-07 15:48   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations requests
larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
smaller sizes. This is a bit unfortunate because there is no way to
express the same semantic for those requests and they are considered too
important to fail so they might end up looping in the page allocator for
ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that
the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
  attempt to free memory at all. The most light weight mode which even
  doesn't kick the background reclaim. Should be used carefully because
  it might deplete the memory and the next user might hit the more
  aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
  allocation without any attempt to free memory from the current context
  but can wake kswapd to reclaim memory if the zone is below the low
  watermark. Can be used from either atomic contexts or when the request
  is a performance optimization and there is another fallback for a slow
  path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
  sleeping allocation with an expensive fallback so it can access some
  portion of memory reserves. Usually used from interrupt/bh context with
  an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail (unless the requesting task
  is killed by the OOM killer) and costly will fail early rather than
  cause disruptive reclaim.
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
  all allocation requests fail early rather than cause disruptive
  reclaim (one round of reclaim in this implementation). The OOM killer
  is not invoked.
- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
  and all allocation requests try really hard. The request will fail if the
  reclaim cannot make any progress. The OOM killer won't be triggered.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
  and all allocation requests will loop endlessly until they
  succeed. This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point. This
means that all the reclaim opportunities have been exhausted except the
most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 fs/btrfs/check-integrity.c                   |  2 +-
 fs/btrfs/raid56.c                            |  2 +-
 include/linux/gfp.h                          | 32 +++++++++++++++++++---------
 include/linux/slab.h                         |  3 ++-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 ++--
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 14 +++++++++---
 mm/sparse-vmemmap.c                          |  4 ++--
 mm/util.c                                    |  6 +++---
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 +++----
 net/core/dev.c                               |  6 +++---
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 25 files changed, 66 insertions(+), 45 deletions(-)

diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
index c41331398752..7a065ac4a9d1 100644
--- a/Documentation/DMA-ISA-LPC.txt
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc.
 
 Unfortunately the memory available for ISA DMA is scarce so unless you
 allocate the memory during boot-up it's a good idea to also pass
-__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder.
+__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
 
 (This scarcity also means that you should allocate the buffer as
 early as possible and not release it until the driver is unloaded.)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index cd5e7aa8cc34..1b835aa5b4d1 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 	return (pgd_t *)__get_free_page(PGALLOC_GFP);
 #else
 	struct page *page;
-	page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4);
+	page = alloc_pages(PGALLOC_GFP | __GFP_RETRY_MAYFAIL, 4);
 	if (!page)
 		return NULL;
 	return (pgd_t *) page_address(page);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8c68145ba1bd..8ad2c309f14a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -93,7 +93,7 @@ int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
 	}
 
 	if (!hpt)
-		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT
+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL
 				       |__GFP_NOWARN, order - PAGE_SHIFT);
 
 	if (!hpt)
diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c
index bd04e8bae010..b58fa5b5b972 100644
--- a/drivers/mmc/host/wbsd.c
+++ b/drivers/mmc/host/wbsd.c
@@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma)
 	 * order for ISA to be able to DMA to it.
 	 */
 	host->dma_buffer = kmalloc(WBSD_DMA_SIZE,
-		GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
+		GFP_NOIO | GFP_DMA | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 	if (!host->dma_buffer)
 		goto free;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 65f5a794f26d..98749fa817da 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count,
 	}
 	if (!session->response)
 		session->response = (char *)__get_free_pages(GFP_KERNEL
-						| __GFP_REPEAT | GFP_DMA,
+						| __GFP_RETRY_MAYFAIL | GFP_DMA,
 						get_order(session->bufsize));
 	if (!session->response) {
 		mutex_unlock(&session->mutex);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 434d9d693989..e585d301c665 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -251,7 +251,7 @@ int transport_alloc_session_tags(struct se_session *se_sess,
 	int rc;
 
 	se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
-					GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+					GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!se_sess->sess_cmd_map) {
 		se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
 		if (!se_sess->sess_cmd_map) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f61f852d6cfd..7d2c4ce6d8d1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -817,7 +817,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int i;
 
-	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!n)
 		return -ENOMEM;
 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index fd6c8b66f06f..ff02a942c4d5 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1404,7 +1404,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
 
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!vs) {
 		vs = vzalloc(sizeof(*vs));
 		if (!vs)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index d403c647ba56..5b76242d73e3 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -460,7 +460,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	/* This struct is large and allocation could fail, fall back to vmalloc
 	 * if there is no other way.
 	 */
-	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!vsock)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index ab14c2e635ca..e334ed2b7e64 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2923,7 +2923,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
 		       fs_info->sectorsize, PAGE_SIZE);
 		return -1;
 	}
-	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!state) {
 		state = vzalloc(sizeof(*state));
 		if (!state) {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1571bf26dc07..94af3db1d0e4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -218,7 +218,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info)
 	 * of a failing mount.
 	 */
 	table_size = sizeof(*table) + sizeof(*h) * num_entries;
-	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!table) {
 		table = vzalloc(table_size);
 		if (!table)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2bfcfd33e476..60af7937c6f2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_REPEAT		0x400u
+#define ___GFP_RETRY_MAYFAIL		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -136,26 +136,38 @@ struct vm_area_struct;
  *
  * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
  *
- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail.  This depends upon the particular VM implementation.
+ * The default allocator behavior depends on the request size. We have a concept
+ * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
+ * !costly allocations are too essential to fail so they are implicitly
+ * non-failing (with some exceptions like OOM victims might fail) by default while
+ * costly requests try to be not disruptive and back off even without invoking
+ * the OOM killer. The following three modifiers might be used to override some of
+ * these implicit rules
+ *
+ * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
+ *   return NULL when direct reclaim and memory compaction have failed to allow
+ *   the allocation to succeed.  The OOM killer is not called with the current
+ *   implementation. This is a default mode for costly allocations.
+ *
+ * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
+ *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
+ *   The OOM killer is excluded because this would be too disruptive. This can be
+ *   used to override non-failing default behavior for !costly requests as well as
+ *   fortify costly requests.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  *   cannot handle allocation failures. New users should be evaluated carefully
  *   (and the flag should be used only when there is no reasonable failure
  *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator.
- *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation.
+ *   opencode endless loop around allocator. Using this flag for costly allocations
+ *   is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
+#define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3c37a8c51921..064901ac3e37 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -469,7 +469,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  *
  * %__GFP_NOWARN - If allocation fails, don't issue any warnings.
  *
- * %__GFP_REPEAT - If allocation fails initially, try once more before failing.
+ * %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
+ *   eventually.
  *
  * There are other flags available as well, but these are not intended
  * for general use, and so are not documented here. For a full list of
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94363b2..418142a4efce 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -34,7 +34,7 @@
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_COLD,		"__GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
-	{(unsigned long)__GFP_REPEAT,		"__GFP_REPEAT"},	\
+	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
 	{(unsigned long)__GFP_NORETRY,		"__GFP_NORETRY"},	\
 	{(unsigned long)__GFP_COMP,		"__GFP_COMP"},		\
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a7aa811b7d14..dc598bfe4ce9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1369,7 +1369,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 
 	page = __alloc_pages_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-						__GFP_REPEAT|__GFP_NOWARN,
+						__GFP_RETRY_MAYFAIL|__GFP_NOWARN,
 		huge_page_order(h));
 	if (page) {
 		prep_new_huge_page(h, page, nid);
@@ -1510,7 +1510,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	int order = huge_page_order(h);
-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
 	unsigned int cpuset_mems_cookie;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 823a7a89099b..8e6d347f70fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -23,7 +23,7 @@
  * hints such as HIGHMEM usage.
  */
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
 			__GFP_ATOMIC)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5238b87aec91..bfe4a9bad0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3181,6 +3181,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		goto out;
+	/*
+	 * We have already exhausted all our reclaim opportunities without any
+	 * success so it is time to admit defeat. We will skip the OOM killer
+	 * because it is very likely that the caller has a more reasonable
+	 * fallback than shooting a random task.
+	 */
+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+		goto out;
 	/* The OOM killer does not needlessly kill tasks for lowmem */
 	if (ac->high_zoneidx < ZONE_NORMAL)
 		goto out;
@@ -3309,7 +3317,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	}
 
 	/*
-	 * !costly requests are much more important than __GFP_REPEAT
+	 * !costly requests are much more important than __GFP_RETRY_MAYFAIL
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
 	 * to cope with that. 1/4 retries is rather arbitrary but we
@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * Do not retry costly high order allocations unless they are
-	 * __GFP_REPEAT
+	 * __GFP_RETRY_MAYFAIL
 	 */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
 		goto nopage;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 574c67b663fe..b21ba0dfe102 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 
 		if (node_state(node, N_HIGH_MEMORY))
 			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		else
 			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		if (page)
 			return page_address(page);
diff --git a/mm/util.c b/mm/util.c
index 6ed3e49bf1e5..885a78d1941b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,7 +339,7 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
  * is supported only for large (>32kB) allocations, and it should be used only if
  * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
@@ -364,11 +364,11 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 		kmalloc_flags |= __GFP_NOWARN;
 
 		/*
-		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
+		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
 		 * requests because there is no other way to tell the allocator
 		 * that we want to fail rather than retry endlessly.
 		 */
-		if (!(kmalloc_flags & __GFP_REPEAT) ||
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
 				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 32979d945766..c2fa2e1b79fc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1747,7 +1747,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
  *
- *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL
  *	and __GFP_NOFAIL are not supported
  *
  *	Any use of gfp flags outside of GFP_KERNEL should be consulted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e0a828781e5..8f547176e02c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2435,18 +2435,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 		return false;
 
 	/* Consider stopping depending on scan and reclaim activity */
-	if (sc->gfp_mask & __GFP_REPEAT) {
+	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
 		/*
-		 * For __GFP_REPEAT allocations, stop reclaiming if the
+		 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
 		 * full LRU list has been scanned and we are still failing
 		 * to reclaim pages. This full LRU scan is potentially
-		 * expensive but a __GFP_REPEAT caller really wants to succeed
+		 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
 		 */
 		if (!nr_reclaimed && !nr_scanned)
 			return false;
 	} else {
 		/*
-		 * For non-__GFP_REPEAT allocations which can presumably
+		 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
 		 * fail without consequence, stop if we failed to reclaim
 		 * any pages from the last SWAP_CLUSTER_MAX number of
 		 * pages that were scanned. This will return to the
diff --git a/net/core/dev.c b/net/core/dev.c
index d947308ee255..3e659ac9e0ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7121,7 +7121,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -7161,7 +7161,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -7698,7 +7698,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9ccba86fa23d..26af038e27f0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4653,7 +4653,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 	gfp_head = gfp_mask;
 	if (gfp_head & __GFP_DIRECT_RECLAIM)
-		gfp_head |= __GFP_REPEAT;
+		gfp_head |= __GFP_RETRY_MAYFAIL;
 
 	*errcode = -ENOBUFS;
 	skb = alloc_skb(header_len, gfp_head);
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 594f77d89f6c..daebe062a2dd 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -640,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 6da8d083e4e5..01ca903fcdb9 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -638,7 +638,7 @@ static const struct {
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_COLD",			"CO" },
 	{ "__GFP_NOWARN",		"NWR" },
-	{ "__GFP_REPEAT",		"R" },
+	{ "__GFP_RETRY_MAYFAIL",	"R" },
 	{ "__GFP_NOFAIL",		"NF" },
 	{ "__GFP_NORETRY",		"NR" },
 	{ "__GFP_COMP",			"C" },
-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
@ 2017-03-07 15:48   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations requests
larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
smaller sizes. This is a bit unfortunate because there is no way to
express the same semantic for those requests and they are considered too
important to fail so they might end up looping in the page allocator for
ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that
the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
  attempt to free memory at all. The most light weight mode which even
  doesn't kick the background reclaim. Should be used carefully because
  it might deplete the memory and the next user might hit the more
  aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
  allocation without any attempt to free memory from the current context
  but can wake kswapd to reclaim memory if the zone is below the low
  watermark. Can be used from either atomic contexts or when the request
  is a performance optimization and there is another fallback for a slow
  path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
  sleeping allocation with an expensive fallback so it can access some
  portion of memory reserves. Usually used from interrupt/bh context with
  an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail (unless the requesting task
  is killed by the OOM killer) and costly will fail early rather than
  cause disruptive reclaim.
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
  all allocation requests fail early rather than cause disruptive
  reclaim (one round of reclaim in this implementation). The OOM killer
  is not invoked.
- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
  and all allocation requests try really hard. The request will fail if the
  reclaim cannot make any progress. The OOM killer won't be triggered.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
  and all allocation requests will loop endlessly until they
  succeed. This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point. This
means that all the reclaim opportunities have been exhausted except the
most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 fs/btrfs/check-integrity.c                   |  2 +-
 fs/btrfs/raid56.c                            |  2 +-
 include/linux/gfp.h                          | 32 +++++++++++++++++++---------
 include/linux/slab.h                         |  3 ++-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 ++--
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 14 +++++++++---
 mm/sparse-vmemmap.c                          |  4 ++--
 mm/util.c                                    |  6 +++---
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 +++----
 net/core/dev.c                               |  6 +++---
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 25 files changed, 66 insertions(+), 45 deletions(-)

diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
index c41331398752..7a065ac4a9d1 100644
--- a/Documentation/DMA-ISA-LPC.txt
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc.
 
 Unfortunately the memory available for ISA DMA is scarce so unless you
 allocate the memory during boot-up it's a good idea to also pass
-__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder.
+__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
 
 (This scarcity also means that you should allocate the buffer as
 early as possible and not release it until the driver is unloaded.)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index cd5e7aa8cc34..1b835aa5b4d1 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 	return (pgd_t *)__get_free_page(PGALLOC_GFP);
 #else
 	struct page *page;
-	page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4);
+	page = alloc_pages(PGALLOC_GFP | __GFP_RETRY_MAYFAIL, 4);
 	if (!page)
 		return NULL;
 	return (pgd_t *) page_address(page);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8c68145ba1bd..8ad2c309f14a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -93,7 +93,7 @@ int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
 	}
 
 	if (!hpt)
-		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT
+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL
 				       |__GFP_NOWARN, order - PAGE_SHIFT);
 
 	if (!hpt)
diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c
index bd04e8bae010..b58fa5b5b972 100644
--- a/drivers/mmc/host/wbsd.c
+++ b/drivers/mmc/host/wbsd.c
@@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma)
 	 * order for ISA to be able to DMA to it.
 	 */
 	host->dma_buffer = kmalloc(WBSD_DMA_SIZE,
-		GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
+		GFP_NOIO | GFP_DMA | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 	if (!host->dma_buffer)
 		goto free;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 65f5a794f26d..98749fa817da 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count,
 	}
 	if (!session->response)
 		session->response = (char *)__get_free_pages(GFP_KERNEL
-						| __GFP_REPEAT | GFP_DMA,
+						| __GFP_RETRY_MAYFAIL | GFP_DMA,
 						get_order(session->bufsize));
 	if (!session->response) {
 		mutex_unlock(&session->mutex);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 434d9d693989..e585d301c665 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -251,7 +251,7 @@ int transport_alloc_session_tags(struct se_session *se_sess,
 	int rc;
 
 	se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
-					GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+					GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!se_sess->sess_cmd_map) {
 		se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
 		if (!se_sess->sess_cmd_map) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f61f852d6cfd..7d2c4ce6d8d1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -817,7 +817,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int i;
 
-	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!n)
 		return -ENOMEM;
 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index fd6c8b66f06f..ff02a942c4d5 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1404,7 +1404,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
 
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!vs) {
 		vs = vzalloc(sizeof(*vs));
 		if (!vs)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index d403c647ba56..5b76242d73e3 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -460,7 +460,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	/* This struct is large and allocation could fail, fall back to vmalloc
 	 * if there is no other way.
 	 */
-	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!vsock)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index ab14c2e635ca..e334ed2b7e64 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2923,7 +2923,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
 		       fs_info->sectorsize, PAGE_SIZE);
 		return -1;
 	}
-	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!state) {
 		state = vzalloc(sizeof(*state));
 		if (!state) {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1571bf26dc07..94af3db1d0e4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -218,7 +218,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info)
 	 * of a failing mount.
 	 */
 	table_size = sizeof(*table) + sizeof(*h) * num_entries;
-	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!table) {
 		table = vzalloc(table_size);
 		if (!table)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2bfcfd33e476..60af7937c6f2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_REPEAT		0x400u
+#define ___GFP_RETRY_MAYFAIL		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -136,26 +136,38 @@ struct vm_area_struct;
  *
  * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
  *
- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail.  This depends upon the particular VM implementation.
+ * The default allocator behavior depends on the request size. We have a concept
+ * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
+ * !costly allocations are too essential to fail so they are implicitly
+ * non-failing (with some exceptions like OOM victims might fail) by default while
+ * costly requests try to be not disruptive and back off even without invoking
+ * the OOM killer. The following three modifiers might be used to override some of
+ * these implicit rules
+ *
+ * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
+ *   return NULL when direct reclaim and memory compaction have failed to allow
+ *   the allocation to succeed.  The OOM killer is not called with the current
+ *   implementation. This is a default mode for costly allocations.
+ *
+ * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
+ *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
+ *   The OOM killer is excluded because this would be too disruptive. This can be
+ *   used to override non-failing default behavior for !costly requests as well as
+ *   fortify costly requests.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  *   cannot handle allocation failures. New users should be evaluated carefully
  *   (and the flag should be used only when there is no reasonable failure
  *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator.
- *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation.
+ *   opencode endless loop around allocator. Using this flag for costly allocations
+ *   is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
+#define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3c37a8c51921..064901ac3e37 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -469,7 +469,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  *
  * %__GFP_NOWARN - If allocation fails, don't issue any warnings.
  *
- * %__GFP_REPEAT - If allocation fails initially, try once more before failing.
+ * %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
+ *   eventually.
  *
  * There are other flags available as well, but these are not intended
  * for general use, and so are not documented here. For a full list of
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94363b2..418142a4efce 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -34,7 +34,7 @@
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_COLD,		"__GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
-	{(unsigned long)__GFP_REPEAT,		"__GFP_REPEAT"},	\
+	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
 	{(unsigned long)__GFP_NORETRY,		"__GFP_NORETRY"},	\
 	{(unsigned long)__GFP_COMP,		"__GFP_COMP"},		\
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a7aa811b7d14..dc598bfe4ce9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1369,7 +1369,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 
 	page = __alloc_pages_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-						__GFP_REPEAT|__GFP_NOWARN,
+						__GFP_RETRY_MAYFAIL|__GFP_NOWARN,
 		huge_page_order(h));
 	if (page) {
 		prep_new_huge_page(h, page, nid);
@@ -1510,7 +1510,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	int order = huge_page_order(h);
-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
 	unsigned int cpuset_mems_cookie;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 823a7a89099b..8e6d347f70fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -23,7 +23,7 @@
  * hints such as HIGHMEM usage.
  */
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
 			__GFP_ATOMIC)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5238b87aec91..bfe4a9bad0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3181,6 +3181,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		goto out;
+	/*
+	 * We have already exhausted all our reclaim opportunities without any
+	 * success so it is time to admit defeat. We will skip the OOM killer
+	 * because it is very likely that the caller has a more reasonable
+	 * fallback than shooting a random task.
+	 */
+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+		goto out;
 	/* The OOM killer does not needlessly kill tasks for lowmem */
 	if (ac->high_zoneidx < ZONE_NORMAL)
 		goto out;
@@ -3309,7 +3317,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	}
 
 	/*
-	 * !costly requests are much more important than __GFP_REPEAT
+	 * !costly requests are much more important than __GFP_RETRY_MAYFAIL
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
 	 * to cope with that. 1/4 retries is rather arbitrary but we
@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * Do not retry costly high order allocations unless they are
-	 * __GFP_REPEAT
+	 * __GFP_RETRY_MAYFAIL
 	 */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
 		goto nopage;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 574c67b663fe..b21ba0dfe102 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 
 		if (node_state(node, N_HIGH_MEMORY))
 			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		else
 			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		if (page)
 			return page_address(page);
diff --git a/mm/util.c b/mm/util.c
index 6ed3e49bf1e5..885a78d1941b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,7 +339,7 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
  * is supported only for large (>32kB) allocations, and it should be used only if
  * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
@@ -364,11 +364,11 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 		kmalloc_flags |= __GFP_NOWARN;
 
 		/*
-		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
+		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
 		 * requests because there is no other way to tell the allocator
 		 * that we want to fail rather than retry endlessly.
 		 */
-		if (!(kmalloc_flags & __GFP_REPEAT) ||
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
 				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 32979d945766..c2fa2e1b79fc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1747,7 +1747,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
  *
- *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL
  *	and __GFP_NOFAIL are not supported
  *
  *	Any use of gfp flags outside of GFP_KERNEL should be consulted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e0a828781e5..8f547176e02c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2435,18 +2435,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 		return false;
 
 	/* Consider stopping depending on scan and reclaim activity */
-	if (sc->gfp_mask & __GFP_REPEAT) {
+	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
 		/*
-		 * For __GFP_REPEAT allocations, stop reclaiming if the
+		 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
 		 * full LRU list has been scanned and we are still failing
 		 * to reclaim pages. This full LRU scan is potentially
-		 * expensive but a __GFP_REPEAT caller really wants to succeed
+		 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
 		 */
 		if (!nr_reclaimed && !nr_scanned)
 			return false;
 	} else {
 		/*
-		 * For non-__GFP_REPEAT allocations which can presumably
+		 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
 		 * fail without consequence, stop if we failed to reclaim
 		 * any pages from the last SWAP_CLUSTER_MAX number of
 		 * pages that were scanned. This will return to the
diff --git a/net/core/dev.c b/net/core/dev.c
index d947308ee255..3e659ac9e0ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7121,7 +7121,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -7161,7 +7161,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -7698,7 +7698,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9ccba86fa23d..26af038e27f0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4653,7 +4653,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 	gfp_head = gfp_mask;
 	if (gfp_head & __GFP_DIRECT_RECLAIM)
-		gfp_head |= __GFP_REPEAT;
+		gfp_head |= __GFP_RETRY_MAYFAIL;
 
 	*errcode = -ENOBUFS;
 	skb = alloc_skb(header_len, gfp_head);
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 594f77d89f6c..daebe062a2dd 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -640,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 6da8d083e4e5..01ca903fcdb9 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -638,7 +638,7 @@ static const struct {
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_COLD",			"CO" },
 	{ "__GFP_NOWARN",		"NWR" },
-	{ "__GFP_REPEAT",		"R" },
+	{ "__GFP_RETRY_MAYFAIL",	"R" },
 	{ "__GFP_NOFAIL",		"NF" },
 	{ "__GFP_NORETRY",		"NR" },
 	{ "__GFP_COMP",			"C" },
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-07 15:48 ` Michal Hocko
@ 2017-03-07 15:48   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Darrick J. Wong

From: Michal Hocko <mhocko@suse.com>

KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set
of flags. This means that small allocations actually never failed.

Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
allocation request size we can map KM_MAYFAIL to it. The allocator will
try as hard as it can to fulfill the request but fails eventually if
the progress cannot be made.

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index ae08cfd9552a..ac80a4855c83 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
 			lflags &= ~__GFP_FS;
 	}
 
+	/*
+	 * Default page/slab allocator behavior is to retry for ever
+	 * for small allocations. We can override this behavior by using
+	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
+	 * as it is feasible but rather fail than retry for ever for all
+	 * request sizes.
+	 */
+	if (flags & KM_MAYFAIL)
+		lflags |= __GFP_RETRY_MAYFAIL;
+
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-07 15:48   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Darrick J. Wong

From: Michal Hocko <mhocko@suse.com>

KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set
of flags. This means that small allocations actually never failed.

Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
allocation request size we can map KM_MAYFAIL to it. The allocator will
try as hard as it can to fulfill the request but fails eventually if
the progress cannot be made.

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index ae08cfd9552a..ac80a4855c83 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
 			lflags &= ~__GFP_FS;
 	}
 
+	/*
+	 * Default page/slab allocator behavior is to retry for ever
+	 * for small allocations. We can override this behavior by using
+	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
+	 * as it is feasible but rather fail than retry for ever for all
+	 * request sizes.
+	 */
+	if (flags & KM_MAYFAIL)
+		lflags |= __GFP_RETRY_MAYFAIL;
+
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 4/4] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
  2017-03-07 15:48 ` Michal Hocko
@ 2017-03-07 15:48   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/util.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 885a78d1941b..359d80333301 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,9 +339,9 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
- * is supported only for large (>32kB) allocations, and it should be used only if
- * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
+ * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
  * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
  */
@@ -363,13 +363,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NOWARN;
 
-		/*
-		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
-		 * requests because there is no other way to tell the allocator
-		 * that we want to fail rather than retry endlessly.
-		 */
-		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
-				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 4/4] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
@ 2017-03-07 15:48   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-07 15:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/util.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 885a78d1941b..359d80333301 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,9 +339,9 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
- * is supported only for large (>32kB) allocations, and it should be used only if
- * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
+ * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
  * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
  */
@@ -363,13 +363,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NOWARN;
 
-		/*
-		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
-		 * requests because there is no other way to tell the allocator
-		 * that we want to fail rather than retry endlessly.
-		 */
-		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
-				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-07 15:48   ` Michal Hocko
@ 2017-03-07 17:05     ` Darrick J. Wong
  -1 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2017-03-07 17:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
> 
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ae08cfd9552a..ac80a4855c83 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry for ever for all

s/for ever/forever/

> +	 * request sizes.
> +	 */
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_MAYFAIL;

But otherwise seems ok from a quick grep -B5 MAYFAIL through the XFS code.

(Has this been tested anywhere?)

--D

> +
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-07 17:05     ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2017-03-07 17:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
> 
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ae08cfd9552a..ac80a4855c83 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry for ever for all

s/for ever/forever/

> +	 * request sizes.
> +	 */
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_MAYFAIL;

But otherwise seems ok from a quick grep -B5 MAYFAIL through the XFS code.

(Has this been tested anywhere?)

--D

> +
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> -- 
> 2.11.0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
  2017-03-07 15:48   ` Michal Hocko
@ 2017-03-08  8:23     ` Heiko Carstens
  -1 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2017-03-08  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __GFP_REPEAT has a rather weak semantic but since it has been introduced
> around 2.6.12 it has been ignored for low order allocations.
> 
> page_table_alloc then uses the flag for a single page allocation. This
> means that this flag has never been actually useful here because it has
> always been used only for PAGE_ALLOC_COSTLY requests.
> 
> An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> superfluous __GFP_REPEAT") has missed this one but the situation is very
> same here.
> 
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/s390/mm/pgalloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

FWIW:
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

If you want, this can be routed via the s390 tree, whatever you prefer.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
@ 2017-03-08  8:23     ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2017-03-08  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __GFP_REPEAT has a rather weak semantic but since it has been introduced
> around 2.6.12 it has been ignored for low order allocations.
> 
> page_table_alloc then uses the flag for a single page allocation. This
> means that this flag has never been actually useful here because it has
> always been used only for PAGE_ALLOC_COSTLY requests.
> 
> An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> superfluous __GFP_REPEAT") has missed this one but the situation is very
> same here.
> 
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/s390/mm/pgalloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

FWIW:
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

If you want, this can be routed via the s390 tree, whatever you prefer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-07 17:05     ` Darrick J. Wong
@ 2017-03-08  9:35       ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08  9:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Tue 07-03-17 09:05:19, Darrick J. Wong wrote:
> On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> > 
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index ae08cfd9552a..ac80a4855c83 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  			lflags &= ~__GFP_FS;
> >  	}
> >  
> > +	/*
> > +	 * Default page/slab allocator behavior is to retry for ever
> > +	 * for small allocations. We can override this behavior by using
> > +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> > +	 * as it is feasible but rather fail than retry for ever for all
> 
> s/for ever/forever/

fixed

> 
> > +	 * request sizes.
> > +	 */
> > +	if (flags & KM_MAYFAIL)
> > +		lflags |= __GFP_RETRY_MAYFAIL;
> 
> But otherwise seems ok from a quick grep -B5 MAYFAIL through the XFS code.
> 
> (Has this been tested anywhere?)

not yet, this is more for a discussion at this stage. I plan to run it
through xfstests once we agree on the proper semantic. I have to confess
I rely on the proper KM_MAYFAIL annotations here, though.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-08  9:35       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08  9:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Tue 07-03-17 09:05:19, Darrick J. Wong wrote:
> On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> > 
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index ae08cfd9552a..ac80a4855c83 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  			lflags &= ~__GFP_FS;
> >  	}
> >  
> > +	/*
> > +	 * Default page/slab allocator behavior is to retry for ever
> > +	 * for small allocations. We can override this behavior by using
> > +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> > +	 * as it is feasible but rather fail than retry for ever for all
> 
> s/for ever/forever/

fixed

> 
> > +	 * request sizes.
> > +	 */
> > +	if (flags & KM_MAYFAIL)
> > +		lflags |= __GFP_RETRY_MAYFAIL;
> 
> But otherwise seems ok from a quick grep -B5 MAYFAIL through the XFS code.
> 
> (Has this been tested anywhere?)

not yet, this is more for a discussion at this stage. I plan to run it
through xfstests once we agree on the proper semantic. I have to confess
I rely on the proper KM_MAYFAIL annotations here, though.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-07 15:48   ` Michal Hocko
@ 2017-03-08 11:23     ` Tetsuo Handa
  -1 siblings, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2017-03-08 11:23 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Darrick J. Wong

On 2017/03/08 0:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
> 
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ae08cfd9552a..ac80a4855c83 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry for ever for all
> +	 * request sizes.
> +	 */
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_MAYFAIL;

I don't see advantages of supporting both __GFP_NORETRY and __GFP_RETRY_MAYFAIL.
kmem_flags_convert() can always set __GFP_NORETRY because the callers use
opencoded __GFP_NOFAIL loop (with possible allocation lockup warning) unless
KM_MAYFAIL is set.

> +
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> 

Well, commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the
allocator slowpath") unexpectedly changed to always give up without using
memory reserves (unless __GFP_NOFAIL is set) if TIF_MEMDIE is set to current
thread when current thread is inside __alloc_pages_may_oom() (precisely speaking,
if TIF_MEMDIE is set when current thread is after

        if (gfp_pfmemalloc_allowed(gfp_mask))
                alloc_flags = ALLOC_NO_WATERMARKS;

line and before

        /* Avoid allocations with no watermarks from looping endlessly */
        if (test_thread_flag(TIF_MEMDIE))
                goto nopage;

line, which is likely always true); but this is off-topic for this thread.

The lines which are executed only when __GFP_RETRY_MAYFAIL is set rather than
__GFP_NORETRY is set are

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_RETRY_MAYFAIL
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
                goto nopage;

        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;

        /*
         * It doesn't make any sense to retry for the compaction if the order-0
         * reclaim is not able to make any progress because the current
         * implementation of the compaction depends on the sufficient amount
         * of free memory (see __compaction_suitable)
         */
        if (did_some_progress > 0 &&
                        should_compact_retry(ac, order, alloc_flags,
                                compact_result, &compact_priority,
                                &compaction_retries))
                goto retry;

        /*
         * It's possible we raced with cpuset update so the OOM would be
         * premature (see below the nopage: label for full explanation).
         */
        if (read_mems_allowed_retry(cpuset_mems_cookie))
                goto retry_cpuset;

        /* Reclaim has failed us, start killing things */
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
                goto got_pg;

__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        const struct alloc_context *ac, unsigned long *did_some_progress)
{
        struct oom_control oc = {
                .zonelist = ac->zonelist,
                .nodemask = ac->nodemask,
                .memcg = NULL,
                .gfp_mask = gfp_mask,
                .order = order,
        };
        struct page *page;

        *did_some_progress = 0;

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure.
         */
        page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
                                        ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
        if (page)
                goto out;

        /* Coredumps can quickly deplete all memory reserves */
        if (current->flags & PF_DUMPCORE)
                goto out;
        /* The OOM killer will not help higher order allocs */
        if (order > PAGE_ALLOC_COSTLY_ORDER)
                goto out;
        /*
         * We have already exhausted all our reclaim opportunities without any
         * success so it is time to admit defeat. We will skip the OOM killer
         * because it is very likely that the caller has a more reasonable
         * fallback than shooting a random task.
         */
        if (gfp_mask & __GFP_RETRY_MAYFAIL)
                goto out;

where both __GFP_NORETRY and __GFP_RETRY_MAYFAIL are checked after
direct reclaim and compaction failed. __GFP_RETRY_MAYFAIL optimistically
retries based on one of should_reclaim_retry() or should_compact_retry()
or read_mems_allowed_retry() returns true or mutex_trylock(&oom_lock) in
__alloc_pages_may_oom() returns 0. If !__GFP_FS allocation requests are
holding oom_lock each other, __GFP_RETRY_MAYFAIL allocation requests (which
are likely !__GFP_FS allocation requests due to __GFP_FS allocation requests
being blocked on direct reclaim) can be blocked for uncontrollable duration
without making progress. It seems to me that the difference between
__GFP_NORETRY and __GFP_RETRY_MAYFAIL is not useful. Rather, the caller can
set __GFP_NORETRY and retry with any control (e.g. set __GFP_HIGH upon first
timeout, give up upon second timeout).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-08 11:23     ` Tetsuo Handa
  0 siblings, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2017-03-08 11:23 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko, Darrick J. Wong

On 2017/03/08 0:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
> 
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ae08cfd9552a..ac80a4855c83 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry for ever for all
> +	 * request sizes.
> +	 */
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_MAYFAIL;

I don't see advantages of supporting both __GFP_NORETRY and __GFP_RETRY_MAYFAIL.
kmem_flags_convert() can always set __GFP_NORETRY because the callers use
opencoded __GFP_NOFAIL loop (with possible allocation lockup warning) unless
KM_MAYFAIL is set.

> +
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> 

Well, commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the
allocator slowpath") unexpectedly changed to always give up without using
memory reserves (unless __GFP_NOFAIL is set) if TIF_MEMDIE is set to current
thread when current thread is inside __alloc_pages_may_oom() (precisely speaking,
if TIF_MEMDIE is set when current thread is after

        if (gfp_pfmemalloc_allowed(gfp_mask))
                alloc_flags = ALLOC_NO_WATERMARKS;

line and before

        /* Avoid allocations with no watermarks from looping endlessly */
        if (test_thread_flag(TIF_MEMDIE))
                goto nopage;

line, which is likely always true); but this is off-topic for this thread.

The lines which are executed only when __GFP_RETRY_MAYFAIL is set rather than
__GFP_NORETRY is set are

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_RETRY_MAYFAIL
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
                goto nopage;

        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;

        /*
         * It doesn't make any sense to retry for the compaction if the order-0
         * reclaim is not able to make any progress because the current
         * implementation of the compaction depends on the sufficient amount
         * of free memory (see __compaction_suitable)
         */
        if (did_some_progress > 0 &&
                        should_compact_retry(ac, order, alloc_flags,
                                compact_result, &compact_priority,
                                &compaction_retries))
                goto retry;

        /*
         * It's possible we raced with cpuset update so the OOM would be
         * premature (see below the nopage: label for full explanation).
         */
        if (read_mems_allowed_retry(cpuset_mems_cookie))
                goto retry_cpuset;

        /* Reclaim has failed us, start killing things */
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
                goto got_pg;

__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        const struct alloc_context *ac, unsigned long *did_some_progress)
{
        struct oom_control oc = {
                .zonelist = ac->zonelist,
                .nodemask = ac->nodemask,
                .memcg = NULL,
                .gfp_mask = gfp_mask,
                .order = order,
        };
        struct page *page;

        *did_some_progress = 0;

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure.
         */
        page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
                                        ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
        if (page)
                goto out;

        /* Coredumps can quickly deplete all memory reserves */
        if (current->flags & PF_DUMPCORE)
                goto out;
        /* The OOM killer will not help higher order allocs */
        if (order > PAGE_ALLOC_COSTLY_ORDER)
                goto out;
        /*
         * We have already exhausted all our reclaim opportunities without any
         * success so it is time to admit defeat. We will skip the OOM killer
         * because it is very likely that the caller has a more reasonable
         * fallback than shooting a random task.
         */
        if (gfp_mask & __GFP_RETRY_MAYFAIL)
                goto out;

where both __GFP_NORETRY and __GFP_RETRY_MAYFAIL are checked after
direct reclaim and compaction failed. __GFP_RETRY_MAYFAIL optimistically
retries based on one of should_reclaim_retry() or should_compact_retry()
or read_mems_allowed_retry() returns true or mutex_trylock(&oom_lock) in
__alloc_pages_may_oom() returns 0. If !__GFP_FS allocation requests are
holding oom_lock each other, __GFP_RETRY_MAYFAIL allocation requests (which
are likely !__GFP_FS allocation requests due to __GFP_FS allocation requests
being blocked on direct reclaim) can be blocked for uncontrollable duration
without making progress. It seems to me that the difference between
__GFP_NORETRY and __GFP_RETRY_MAYFAIL is not useful. Rather, the caller can
set __GFP_NORETRY and retry with any control (e.g. set __GFP_HIGH upon first
timeout, give up upon second timeout).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-08 11:23     ` Tetsuo Handa
@ 2017-03-08 12:54       ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08 12:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong

On Wed 08-03-17 20:23:37, Tetsuo Handa wrote:
> On 2017/03/08 0:48, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> > 
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index ae08cfd9552a..ac80a4855c83 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  			lflags &= ~__GFP_FS;
> >  	}
> >  
> > +	/*
> > +	 * Default page/slab allocator behavior is to retry for ever
> > +	 * for small allocations. We can override this behavior by using
> > +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> > +	 * as it is feasible but rather fail than retry for ever for all
> > +	 * request sizes.
> > +	 */
> > +	if (flags & KM_MAYFAIL)
> > +		lflags |= __GFP_RETRY_MAYFAIL;
> 
> I don't see advantages of supporting both __GFP_NORETRY and __GFP_RETRY_MAYFAIL.
> kmem_flags_convert() can always set __GFP_NORETRY because the callers use
> opencoded __GFP_NOFAIL loop (with possible allocation lockup warning) unless
> KM_MAYFAIL is set.

The behavior would be different (e.g. the OOM killer handling).

[...]
> line, which is likely always true); but this is off-topic for this thread.

yes

[...]

> where both __GFP_NORETRY and __GFP_RETRY_MAYFAIL are checked after
> direct reclaim and compaction failed. __GFP_RETRY_MAYFAIL optimistically
> retries based on one of should_reclaim_retry() or should_compact_retry()
> or read_mems_allowed_retry() returns true or mutex_trylock(&oom_lock) in
> __alloc_pages_may_oom() returns 0. If !__GFP_FS allocation requests are
> holding oom_lock each other, __GFP_RETRY_MAYFAIL allocation requests (which
> are likely !__GFP_FS allocation requests due to __GFP_FS allocation requests
> being blocked on direct reclaim) can be blocked for uncontrollable duration
> without making progress. It seems to me that the difference between
> __GFP_NORETRY and __GFP_RETRY_MAYFAIL is not useful. Rather, the caller can
> set __GFP_NORETRY and retry with any control (e.g. set __GFP_HIGH upon first
> timeout, give up upon second timeout).

You are drown in implementation details here. Try to step back and think
about the high level semantic I would like to achieve - which is
essentially a middle ground between __GFP_NORETRY which doesn't retry
and __GFP_NOFAIL to retry for ever. There are users who could benefit
from such a semantic I believe (the most prominent example is kvmalloc
which has different modes of how hard to try kmalloc before giving up
and falling back to vmalloc)..

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-08 12:54       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08 12:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong

On Wed 08-03-17 20:23:37, Tetsuo Handa wrote:
> On 2017/03/08 0:48, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> > 
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index ae08cfd9552a..ac80a4855c83 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  			lflags &= ~__GFP_FS;
> >  	}
> >  
> > +	/*
> > +	 * Default page/slab allocator behavior is to retry for ever
> > +	 * for small allocations. We can override this behavior by using
> > +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> > +	 * as it is feasible but rather fail than retry for ever for all
> > +	 * request sizes.
> > +	 */
> > +	if (flags & KM_MAYFAIL)
> > +		lflags |= __GFP_RETRY_MAYFAIL;
> 
> I don't see advantages of supporting both __GFP_NORETRY and __GFP_RETRY_MAYFAIL.
> kmem_flags_convert() can always set __GFP_NORETRY because the callers use
> opencoded __GFP_NOFAIL loop (with possible allocation lockup warning) unless
> KM_MAYFAIL is set.

The behavior would be different (e.g. the OOM killer handling).

[...]
> line, which is likely always true); but this is off-topic for this thread.

yes

[...]

> where both __GFP_NORETRY and __GFP_RETRY_MAYFAIL are checked after
> direct reclaim and compaction failed. __GFP_RETRY_MAYFAIL optimistically
> retries based on one of should_reclaim_retry() or should_compact_retry()
> or read_mems_allowed_retry() returns true or mutex_trylock(&oom_lock) in
> __alloc_pages_may_oom() returns 0. If !__GFP_FS allocation requests are
> holding oom_lock each other, __GFP_RETRY_MAYFAIL allocation requests (which
> are likely !__GFP_FS allocation requests due to __GFP_FS allocation requests
> being blocked on direct reclaim) can be blocked for uncontrollable duration
> without making progress. It seems to me that the difference between
> __GFP_NORETRY and __GFP_RETRY_MAYFAIL is not useful. Rather, the caller can
> set __GFP_NORETRY and retry with any control (e.g. set __GFP_HIGH upon first
> timeout, give up upon second timeout).

You are drown in implementation details here. Try to step back and think
about the high level semantic I would like to achieve - which is
essentially a middle ground between __GFP_NORETRY which doesn't retry
and __GFP_NOFAIL to retry for ever. There are users who could benefit
from such a semantic I believe (the most prominent example is kvmalloc
which has different modes of how hard to try kmalloc before giving up
and falling back to vmalloc)..

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
  2017-03-08  8:23     ` Heiko Carstens
@ 2017-03-08 14:11       ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08 14:11 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed 08-03-17 09:23:40, Heiko Carstens wrote:
> On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > around 2.6.12 it has been ignored for low order allocations.
> > 
> > page_table_alloc then uses the flag for a single page allocation. This
> > means that this flag has never been actually useful here because it has
> > always been used only for PAGE_ALLOC_COSTLY requests.
> > 
> > An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> > superfluous __GFP_REPEAT") has missed this one but the situation is very
> > same here.
> > 
> > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  arch/s390/mm/pgalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> FWIW:
> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

Thanks

> If you want, this can be routed via the s390 tree, whatever you prefer.

Yes, that would be great. I suspect the rest will take longer to get
merged or land to a conclusion.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
@ 2017-03-08 14:11       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-08 14:11 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed 08-03-17 09:23:40, Heiko Carstens wrote:
> On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > around 2.6.12 it has been ignored for low order allocations.
> > 
> > page_table_alloc then uses the flag for a single page allocation. This
> > means that this flag has never been actually useful here because it has
> > always been used only for PAGE_ALLOC_COSTLY requests.
> > 
> > An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> > superfluous __GFP_REPEAT") has missed this one but the situation is very
> > same here.
> > 
> > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  arch/s390/mm/pgalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> FWIW:
> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

Thanks

> If you want, this can be routed via the s390 tree, whatever you prefer.

Yes, that would be great. I suspect the rest will take longer to get
merged or land to a conclusion.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-07 15:48   ` Michal Hocko
@ 2017-03-08 15:06     ` Christoph Hellwig
  -1 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2017-03-08 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko, Darrick J. Wong

On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.

I don't think we really need this - KM_MAYFAIL is basically just
a flag to not require the retry loop around kmalloc for those places
in XFS that can deal with allocation failures.  But if the default
behavior is to not fail we'll happily take that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-08 15:06     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2017-03-08 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko, Darrick J. Wong

On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.

I don't think we really need this - KM_MAYFAIL is basically just
a flag to not require the retry loop around kmalloc for those places
in XFS that can deal with allocation failures.  But if the default
behavior is to not fail we'll happily take that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
  2017-03-08 14:11       ` Michal Hocko
@ 2017-03-09  8:27         ` Heiko Carstens
  -1 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2017-03-09  8:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed, Mar 08, 2017 at 03:11:10PM +0100, Michal Hocko wrote:
> On Wed 08-03-17 09:23:40, Heiko Carstens wrote:
> > On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > > around 2.6.12 it has been ignored for low order allocations.
> > > 
> > > page_table_alloc then uses the flag for a single page allocation. This
> > > means that this flag has never been actually useful here because it has
> > > always been used only for PAGE_ALLOC_COSTLY requests.
> > > 
> > > An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> > > superfluous __GFP_REPEAT") has missed this one but the situation is very
> > > same here.
> > > 
> > > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  arch/s390/mm/pgalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > FWIW:
> > Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> Thanks
> 
> > If you want, this can be routed via the s390 tree, whatever you prefer.
> 
> Yes, that would be great. I suspect the rest will take longer to get
> merged or land to a conclusion.

Ok, applied. Thanks! :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT
@ 2017-03-09  8:27         ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2017-03-09  8:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed, Mar 08, 2017 at 03:11:10PM +0100, Michal Hocko wrote:
> On Wed 08-03-17 09:23:40, Heiko Carstens wrote:
> > On Tue, Mar 07, 2017 at 04:48:40PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > > around 2.6.12 it has been ignored for low order allocations.
> > > 
> > > page_table_alloc then uses the flag for a single page allocation. This
> > > means that this flag has never been actually useful here because it has
> > > always been used only for PAGE_ALLOC_COSTLY requests.
> > > 
> > > An earlier attempt to remove the flag 10d58bf297e2 ("s390: get rid of
> > > superfluous __GFP_REPEAT") has missed this one but the situation is very
> > > same here.
> > > 
> > > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  arch/s390/mm/pgalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > FWIW:
> > Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> Thanks
> 
> > If you want, this can be routed via the s390 tree, whatever you prefer.
> 
> Yes, that would be great. I suspect the rest will take longer to get
> merged or land to a conclusion.

Ok, applied. Thanks! :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-03-08 15:06     ` Christoph Hellwig
@ 2017-03-09  9:16       ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-09  9:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong

On Wed 08-03-17 07:06:59, Christoph Hellwig wrote:
> On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> 
> I don't think we really need this - KM_MAYFAIL is basically just
> a flag to not require the retry loop around kmalloc for those places
> in XFS that can deal with allocation failures.  But if the default
> behavior is to not fail we'll happily take that.

Does that mean that you are happy to go OOM and trigger the OOM killer
even when you know that the failure can be handled gracefully?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
@ 2017-03-09  9:16       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-03-09  9:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong

On Wed 08-03-17 07:06:59, Christoph Hellwig wrote:
> On Tue, Mar 07, 2017 at 04:48:42PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> 
> I don't think we really need this - KM_MAYFAIL is basically just
> a flag to not require the retry loop around kmalloc for those places
> in XFS that can deal with allocation failures.  But if the default
> behavior is to not fail we'll happily take that.

Does that mean that you are happy to go OOM and trigger the OOM killer
even when you know that the failure can be handled gracefully?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
  2017-03-07 15:48 ` Michal Hocko
@ 2017-05-16  9:10   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-16  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Darrick J. Wong, Heiko Carstens

So, is there some interest in this? I am not going to push this if there
is a general consensus that we do not need to do anything about the
current situation or need a different approach.

On Tue 07-03-17 16:48:39, Michal Hocko wrote:
> Hi,
> this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
> version of this patch series was posted as an RFC
> http://lkml.kernel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
> Since then I have reconsidered the semantic and made it a counterpart
> to the __GFP_NORETRY and made it the other extreme end of the retry
> logic. Both are not invoking the OOM killer so they are suitable
> for allocation paths with a fallback. Also a new potential user has
> emerged (kvmalloc - see patch 4). I have also renamed the flag from
> __GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.
> 
> I have kept the RFC status because of the semantic change. The patch 1
> is an exception because it should be merge regardless of the rest.
> 
> The main motivation for the change is that the current implementation of
> __GFP_REPEAT is not very much useful.
> 
> The documentation says:
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  *   _might_ fail.  This depends upon the particular VM implementation.
> 
> It just fails to mention that this is true only for large (costly) high
> order which has been the case since the flag was introduced. A similar
> semantic would be really helpful for smal orders as well, though,
> because we have places where a failure with a specific fallback error
> handling is preferred to a potential endless loop inside the page
> allocator.
> 
> The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
> users so only those which might use larger orders have stayed. One user
> which slipped through cracks is addressed in patch 1.
> 
> Let's rename the flag to something more verbose and use it for existing
> users. Semantic for those will not change. Then implement low (!costly)
> orders failure path which is hit after the page allocator is about to
> invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
> and finally can tell try as hard as possible without the OOM killer.
> 
> Xfs code already has an existing annotation for allocations which are
> allowed to fail and we can trivially map them to the new gfp flag
> because it will provide the semantic KM_MAYFAIL wants.
> 
> kvmalloc will allow also !costly high order allocations to retry hard
> before falling back to the vmalloc.
> 
> The patchset is based on the current linux-next.
> 
> Shortlog
> Michal Hocko (4):
>       s390: get rid of superfluous __GFP_REPEAT
>       mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
>       xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
>       mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
> 
> Diffstat
>  Documentation/DMA-ISA-LPC.txt                |  2 +-
>  arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
>  arch/s390/mm/pgalloc.c                       |  2 +-
>  drivers/mmc/host/wbsd.c                      |  2 +-
>  drivers/s390/char/vmcp.c                     |  2 +-
>  drivers/target/target_core_transport.c       |  2 +-
>  drivers/vhost/net.c                          |  2 +-
>  drivers/vhost/scsi.c                         |  2 +-
>  drivers/vhost/vsock.c                        |  2 +-
>  fs/btrfs/check-integrity.c                   |  2 +-
>  fs/btrfs/raid56.c                            |  2 +-
>  fs/xfs/kmem.h                                | 10 +++++++++
>  include/linux/gfp.h                          | 32 +++++++++++++++++++---------
>  include/linux/slab.h                         |  3 ++-
>  include/trace/events/mmflags.h               |  2 +-
>  mm/hugetlb.c                                 |  4 ++--
>  mm/internal.h                                |  2 +-
>  mm/page_alloc.c                              | 14 +++++++++---
>  mm/sparse-vmemmap.c                          |  4 ++--
>  mm/util.c                                    | 14 ++++--------
>  mm/vmalloc.c                                 |  2 +-
>  mm/vmscan.c                                  |  8 +++----
>  net/core/dev.c                               |  6 +++---
>  net/core/skbuff.c                            |  2 +-
>  net/sched/sch_fq.c                           |  2 +-
>  tools/perf/builtin-kmem.c                    |  2 +-
>  27 files changed, 78 insertions(+), 53 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
@ 2017-05-16  9:10   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-16  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Darrick J. Wong, Heiko Carstens

So, is there some interest in this? I am not going to push this if there
is a general consensus that we do not need to do anything about the
current situation or need a different approach.

On Tue 07-03-17 16:48:39, Michal Hocko wrote:
> Hi,
> this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
> version of this patch series was posted as an RFC
> http://lkml.kernel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
> Since then I have reconsidered the semantic and made it a counterpart
> to the __GFP_NORETRY and made it the other extreme end of the retry
> logic. Both are not invoking the OOM killer so they are suitable
> for allocation paths with a fallback. Also a new potential user has
> emerged (kvmalloc - see patch 4). I have also renamed the flag from
> __GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.
> 
> I have kept the RFC status because of the semantic change. The patch 1
> is an exception because it should be merge regardless of the rest.
> 
> The main motivation for the change is that the current implementation of
> __GFP_REPEAT is not very much useful.
> 
> The documentation says:
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  *   _might_ fail.  This depends upon the particular VM implementation.
> 
> It just fails to mention that this is true only for large (costly) high
> order which has been the case since the flag was introduced. A similar
> semantic would be really helpful for smal orders as well, though,
> because we have places where a failure with a specific fallback error
> handling is preferred to a potential endless loop inside the page
> allocator.
> 
> The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
> users so only those which might use larger orders have stayed. One user
> which slipped through cracks is addressed in patch 1.
> 
> Let's rename the flag to something more verbose and use it for existing
> users. Semantic for those will not change. Then implement low (!costly)
> orders failure path which is hit after the page allocator is about to
> invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
> and finally can tell try as hard as possible without the OOM killer.
> 
> Xfs code already has an existing annotation for allocations which are
> allowed to fail and we can trivially map them to the new gfp flag
> because it will provide the semantic KM_MAYFAIL wants.
> 
> kvmalloc will allow also !costly high order allocations to retry hard
> before falling back to the vmalloc.
> 
> The patchset is based on the current linux-next.
> 
> Shortlog
> Michal Hocko (4):
>       s390: get rid of superfluous __GFP_REPEAT
>       mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
>       xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
>       mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
> 
> Diffstat
>  Documentation/DMA-ISA-LPC.txt                |  2 +-
>  arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
>  arch/s390/mm/pgalloc.c                       |  2 +-
>  drivers/mmc/host/wbsd.c                      |  2 +-
>  drivers/s390/char/vmcp.c                     |  2 +-
>  drivers/target/target_core_transport.c       |  2 +-
>  drivers/vhost/net.c                          |  2 +-
>  drivers/vhost/scsi.c                         |  2 +-
>  drivers/vhost/vsock.c                        |  2 +-
>  fs/btrfs/check-integrity.c                   |  2 +-
>  fs/btrfs/raid56.c                            |  2 +-
>  fs/xfs/kmem.h                                | 10 +++++++++
>  include/linux/gfp.h                          | 32 +++++++++++++++++++---------
>  include/linux/slab.h                         |  3 ++-
>  include/trace/events/mmflags.h               |  2 +-
>  mm/hugetlb.c                                 |  4 ++--
>  mm/internal.h                                |  2 +-
>  mm/page_alloc.c                              | 14 +++++++++---
>  mm/sparse-vmemmap.c                          |  4 ++--
>  mm/util.c                                    | 14 ++++--------
>  mm/vmalloc.c                                 |  2 +-
>  mm/vmscan.c                                  |  8 +++----
>  net/core/dev.c                               |  6 +++---
>  net/core/skbuff.c                            |  2 +-
>  net/sched/sch_fq.c                           |  2 +-
>  tools/perf/builtin-kmem.c                    |  2 +-
>  27 files changed, 78 insertions(+), 53 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
  2017-05-16  9:10   ` Michal Hocko
@ 2017-05-23  8:12     ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2017-05-23  8:12 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, LKML,
	Darrick J. Wong, Heiko Carstens, NeilBrown, Jonathan Corbet,
	Paolo Bonzini, Eric W. Biederman

On 05/16/2017 11:10 AM, Michal Hocko wrote:
> So, is there some interest in this? I am not going to push this if there
> is a general consensus that we do not need to do anything about the
> current situation or need a different approach.

After the recent LWN article [1] I think that we should really support
marking allocations as failable, without making them too easily failable
via __GFP_NORETRY. The __GFP_RETRY_MAY_FAIL flag sounds like a good way
to do that without introducing a new __GFP_MAYFAIL. We could also
introduce a wrapper such as GFP_KERNEL_MAYFAIL.

[1] https://lwn.net/Articles/723317/

> On Tue 07-03-17 16:48:39, Michal Hocko wrote:
>> Hi,
>> this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
>> version of this patch series was posted as an RFC
>> http://lkml.keprnel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
>> Since then I have reconsidered the semantic and made it a counterpart
>> to the __GFP_NORETRY and made it the other extreme end of the retry
>> logic. Both are not invoking the OOM killer so they are suitable
>> for allocation paths with a fallback. Also a new potential user has
>> emerged (kvmalloc - see patch 4). I have also renamed the flag from
>> __GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.
>>
>> I have kept the RFC status because of the semantic change. The patch 1
>> is an exception because it should be merge regardless of the rest.
>>
>> The main motivation for the change is that the current implementation of
>> __GFP_REPEAT is not very much useful.
>>
>> The documentation says:
>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>  *   _might_ fail.  This depends upon the particular VM implementation.
>>
>> It just fails to mention that this is true only for large (costly) high
>> order which has been the case since the flag was introduced. A similar
>> semantic would be really helpful for smal orders as well, though,
>> because we have places where a failure with a specific fallback error
>> handling is preferred to a potential endless loop inside the page
>> allocator.
>>
>> The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
>> users so only those which might use larger orders have stayed. One user
>> which slipped through cracks is addressed in patch 1.
>>
>> Let's rename the flag to something more verbose and use it for existing
>> users. Semantic for those will not change. Then implement low (!costly)
>> orders failure path which is hit after the page allocator is about to
>> invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
>> and finally can tell try as hard as possible without the OOM killer.
>>
>> Xfs code already has an existing annotation for allocations which are
>> allowed to fail and we can trivially map them to the new gfp flag
>> because it will provide the semantic KM_MAYFAIL wants.
>>
>> kvmalloc will allow also !costly high order allocations to retry hard
>> before falling back to the vmalloc.
>>
>> The patchset is based on the current linux-next.
>>
>> Shortlog
>> Michal Hocko (4):
>>       s390: get rid of superfluous __GFP_REPEAT
>>       mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
>>       xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
>>       mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
>>
>> Diffstat
>>  Documentation/DMA-ISA-LPC.txt                |  2 +-
>>  arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
>>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
>>  arch/s390/mm/pgalloc.c                       |  2 +-
>>  drivers/mmc/host/wbsd.c                      |  2 +-
>>  drivers/s390/char/vmcp.c                     |  2 +-
>>  drivers/target/target_core_transport.c       |  2 +-
>>  drivers/vhost/net.c                          |  2 +-
>>  drivers/vhost/scsi.c                         |  2 +-
>>  drivers/vhost/vsock.c                        |  2 +-
>>  fs/btrfs/check-integrity.c                   |  2 +-
>>  fs/btrfs/raid56.c                            |  2 +-
>>  fs/xfs/kmem.h                                | 10 +++++++++
>>  include/linux/gfp.h                          | 32 +++++++++++++++++++---------
>>  include/linux/slab.h                         |  3 ++-
>>  include/trace/events/mmflags.h               |  2 +-
>>  mm/hugetlb.c                                 |  4 ++--
>>  mm/internal.h                                |  2 +-
>>  mm/page_alloc.c                              | 14 +++++++++---
>>  mm/sparse-vmemmap.c                          |  4 ++--
>>  mm/util.c                                    | 14 ++++--------
>>  mm/vmalloc.c                                 |  2 +-
>>  mm/vmscan.c                                  |  8 +++----
>>  net/core/dev.c                               |  6 +++---
>>  net/core/skbuff.c                            |  2 +-
>>  net/sched/sch_fq.c                           |  2 +-
>>  tools/perf/builtin-kmem.c                    |  2 +-
>>  27 files changed, 78 insertions(+), 53 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
@ 2017-05-23  8:12     ` Vlastimil Babka
  0 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2017-05-23  8:12 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, LKML,
	Darrick J. Wong, Heiko Carstens, NeilBrown, Jonathan Corbet,
	Paolo Bonzini, Eric W. Biederman

On 05/16/2017 11:10 AM, Michal Hocko wrote:
> So, is there some interest in this? I am not going to push this if there
> is a general consensus that we do not need to do anything about the
> current situation or need a different approach.

After the recent LWN article [1] I think that we should really support
marking allocations as failable, without making them too easily failable
via __GFP_NORETRY. The __GFP_RETRY_MAY_FAIL flag sounds like a good way
to do that without introducing a new __GFP_MAYFAIL. We could also
introduce a wrapper such as GFP_KERNEL_MAYFAIL.

[1] https://lwn.net/Articles/723317/

> On Tue 07-03-17 16:48:39, Michal Hocko wrote:
>> Hi,
>> this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
>> version of this patch series was posted as an RFC
>> http://lkml.keprnel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
>> Since then I have reconsidered the semantic and made it a counterpart
>> to the __GFP_NORETRY and made it the other extreme end of the retry
>> logic. Both are not invoking the OOM killer so they are suitable
>> for allocation paths with a fallback. Also a new potential user has
>> emerged (kvmalloc - see patch 4). I have also renamed the flag from
>> __GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.
>>
>> I have kept the RFC status because of the semantic change. The patch 1
>> is an exception because it should be merge regardless of the rest.
>>
>> The main motivation for the change is that the current implementation of
>> __GFP_REPEAT is not very much useful.
>>
>> The documentation says:
>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>  *   _might_ fail.  This depends upon the particular VM implementation.
>>
>> It just fails to mention that this is true only for large (costly) high
>> order which has been the case since the flag was introduced. A similar
>> semantic would be really helpful for smal orders as well, though,
>> because we have places where a failure with a specific fallback error
>> handling is preferred to a potential endless loop inside the page
>> allocator.
>>
>> The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
>> users so only those which might use larger orders have stayed. One user
>> which slipped through cracks is addressed in patch 1.
>>
>> Let's rename the flag to something more verbose and use it for existing
>> users. Semantic for those will not change. Then implement low (!costly)
>> orders failure path which is hit after the page allocator is about to
>> invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
>> and finally can tell try as hard as possible without the OOM killer.
>>
>> Xfs code already has an existing annotation for allocations which are
>> allowed to fail and we can trivially map them to the new gfp flag
>> because it will provide the semantic KM_MAYFAIL wants.
>>
>> kvmalloc will allow also !costly high order allocations to retry hard
>> before falling back to the vmalloc.
>>
>> The patchset is based on the current linux-next.
>>
>> Shortlog
>> Michal Hocko (4):
>>       s390: get rid of superfluous __GFP_REPEAT
>>       mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
>>       xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
>>       mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
>>
>> Diffstat
>>  Documentation/DMA-ISA-LPC.txt                |  2 +-
>>  arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
>>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
>>  arch/s390/mm/pgalloc.c                       |  2 +-
>>  drivers/mmc/host/wbsd.c                      |  2 +-
>>  drivers/s390/char/vmcp.c                     |  2 +-
>>  drivers/target/target_core_transport.c       |  2 +-
>>  drivers/vhost/net.c                          |  2 +-
>>  drivers/vhost/scsi.c                         |  2 +-
>>  drivers/vhost/vsock.c                        |  2 +-
>>  fs/btrfs/check-integrity.c                   |  2 +-
>>  fs/btrfs/raid56.c                            |  2 +-
>>  fs/xfs/kmem.h                                | 10 +++++++++
>>  include/linux/gfp.h                          | 32 +++++++++++++++++++---------
>>  include/linux/slab.h                         |  3 ++-
>>  include/trace/events/mmflags.h               |  2 +-
>>  mm/hugetlb.c                                 |  4 ++--
>>  mm/internal.h                                |  2 +-
>>  mm/page_alloc.c                              | 14 +++++++++---
>>  mm/sparse-vmemmap.c                          |  4 ++--
>>  mm/util.c                                    | 14 ++++--------
>>  mm/vmalloc.c                                 |  2 +-
>>  mm/vmscan.c                                  |  8 +++----
>>  net/core/dev.c                               |  6 +++---
>>  net/core/skbuff.c                            |  2 +-
>>  net/sched/sch_fq.c                           |  2 +-
>>  tools/perf/builtin-kmem.c                    |  2 +-
>>  27 files changed, 78 insertions(+), 53 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
  2017-05-23  8:12     ` Vlastimil Babka
  (?)
@ 2017-05-24  1:06     ` NeilBrown
  2017-05-24  7:34         ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: NeilBrown @ 2017-05-24  1:06 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko, linux-mm
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, LKML,
	Darrick J. Wong, Heiko Carstens, NeilBrown, Jonathan Corbet,
	Paolo Bonzini, Eric W. Biederman

[-- Attachment #1: Type: text/plain, Size: 7088 bytes --]

On Tue, May 23 2017, Vlastimil Babka wrote:

> On 05/16/2017 11:10 AM, Michal Hocko wrote:
>> So, is there some interest in this? I am not going to push this if there
>> is a general consensus that we do not need to do anything about the
>> current situation or need a different approach.
>
> After the recent LWN article [1] I think that we should really support
> marking allocations as failable, without making them too easily failable
> via __GFP_NORETRY. The __GFP_RETRY_MAY_FAIL flag sounds like a good way
> to do that without introducing a new __GFP_MAYFAIL. We could also
> introduce a wrapper such as GFP_KERNEL_MAYFAIL.
>
> [1] https://lwn.net/Articles/723317/

Yes please!!!

I particularly like:

> - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
>   all allocation requests fail early rather than cause disruptive
>   reclaim (one round of reclaim in this implementation). The OOM killer
>   is not invoked.
> - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
>   and all allocation requests try really hard. The request will fail if the
>   reclaim cannot make any progress. The OOM killer won't be triggered.
> - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
>   and all allocation requests will loop endlessly until they
>   succeed. This might be really dangerous especially for larger orders.

There seems to be a good range here, and the two end points are good
choices.
I like that only __GFP_NOFAIL triggers the OOM.
I would like the middle option to be the default.  I think that is what
many people thought the default was.  I appreciate that making the
transition might be awkward.
Maybe create GFP_DEFAULT which matches the middle option and encourage
that in new code??

We would probably want guidelines on when __GFP_NOFAIL is acceptable.
I assume:
  - no locks held
  - small allocations OK, large allocation need clear justification.
  - error would be exposed to systemcall
???

I think it is important to give kernel developers clear options and make
it easy for them to choose the best option.  This helps to do that.

Thanks,
NeilBrown


>
>> On Tue 07-03-17 16:48:39, Michal Hocko wrote:
>>> Hi,
>>> this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
>>> version of this patch series was posted as an RFC
>>> http://lkml.keprnel.org/r/1465212736-14637-1-git-send-email-mhocko@kernel.org
>>> Since then I have reconsidered the semantic and made it a counterpart
>>> to the __GFP_NORETRY and made it the other extreme end of the retry
>>> logic. Both are not invoking the OOM killer so they are suitable
>>> for allocation paths with a fallback. Also a new potential user has
>>> emerged (kvmalloc - see patch 4). I have also renamed the flag from
>>> __GFP_RETRY_HARD to __GFP_RETRY_MAY_FAIL as this should be more clear.
>>>
>>> I have kept the RFC status because of the semantic change. The patch 1
>>> is an exception because it should be merge regardless of the rest.
>>>
>>> The main motivation for the change is that the current implementation of
>>> __GFP_REPEAT is not very much useful.
>>>
>>> The documentation says:
>>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>>  *   _might_ fail.  This depends upon the particular VM implementation.
>>>
>>> It just fails to mention that this is true only for large (costly) high
>>> order which has been the case since the flag was introduced. A similar
>>> semantic would be really helpful for smal orders as well, though,
>>> because we have places where a failure with a specific fallback error
>>> handling is preferred to a potential endless loop inside the page
>>> allocator.
>>>
>>> The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
>>> users so only those which might use larger orders have stayed. One user
>>> which slipped through cracks is addressed in patch 1.
>>>
>>> Let's rename the flag to something more verbose and use it for existing
>>> users. Semantic for those will not change. Then implement low (!costly)
>>> orders failure path which is hit after the page allocator is about to
>>> invoke the oom killer. Now we have a good counterpart for __GFP_NORETRY
>>> and finally can tell try as hard as possible without the OOM killer.
>>>
>>> Xfs code already has an existing annotation for allocations which are
>>> allowed to fail and we can trivially map them to the new gfp flag
>>> because it will provide the semantic KM_MAYFAIL wants.
>>>
>>> kvmalloc will allow also !costly high order allocations to retry hard
>>> before falling back to the vmalloc.
>>>
>>> The patchset is based on the current linux-next.
>>>
>>> Shortlog
>>> Michal Hocko (4):
>>>       s390: get rid of superfluous __GFP_REPEAT
>>>       mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
>>>       xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
>>>       mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
>>>
>>> Diffstat
>>>  Documentation/DMA-ISA-LPC.txt                |  2 +-
>>>  arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
>>>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
>>>  arch/s390/mm/pgalloc.c                       |  2 +-
>>>  drivers/mmc/host/wbsd.c                      |  2 +-
>>>  drivers/s390/char/vmcp.c                     |  2 +-
>>>  drivers/target/target_core_transport.c       |  2 +-
>>>  drivers/vhost/net.c                          |  2 +-
>>>  drivers/vhost/scsi.c                         |  2 +-
>>>  drivers/vhost/vsock.c                        |  2 +-
>>>  fs/btrfs/check-integrity.c                   |  2 +-
>>>  fs/btrfs/raid56.c                            |  2 +-
>>>  fs/xfs/kmem.h                                | 10 +++++++++
>>>  include/linux/gfp.h                          | 32 +++++++++++++++++++---------
>>>  include/linux/slab.h                         |  3 ++-
>>>  include/trace/events/mmflags.h               |  2 +-
>>>  mm/hugetlb.c                                 |  4 ++--
>>>  mm/internal.h                                |  2 +-
>>>  mm/page_alloc.c                              | 14 +++++++++---
>>>  mm/sparse-vmemmap.c                          |  4 ++--
>>>  mm/util.c                                    | 14 ++++--------
>>>  mm/vmalloc.c                                 |  2 +-
>>>  mm/vmscan.c                                  |  8 +++----
>>>  net/core/dev.c                               |  6 +++---
>>>  net/core/skbuff.c                            |  2 +-
>>>  net/sched/sch_fq.c                           |  2 +-
>>>  tools/perf/builtin-kmem.c                    |  2 +-
>>>  27 files changed, 78 insertions(+), 53 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
  2017-05-24  1:06     ` NeilBrown
@ 2017-05-24  7:34         ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-24  7:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Vlastimil Babka, linux-mm, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong, Heiko Carstens, NeilBrown,
	Jonathan Corbet, Paolo Bonzini, Eric W. Biederman

On Wed 24-05-17 11:06:04, NeilBrown wrote:
> On Tue, May 23 2017, Vlastimil Babka wrote:
> 
> > On 05/16/2017 11:10 AM, Michal Hocko wrote:
> >> So, is there some interest in this? I am not going to push this if there
> >> is a general consensus that we do not need to do anything about the
> >> current situation or need a different approach.
> >
> > After the recent LWN article [1] I think that we should really support
> > marking allocations as failable, without making them too easily failable
> > via __GFP_NORETRY. The __GFP_RETRY_MAY_FAIL flag sounds like a good way
> > to do that without introducing a new __GFP_MAYFAIL. We could also
> > introduce a wrapper such as GFP_KERNEL_MAYFAIL.
> >
> > [1] https://lwn.net/Articles/723317/
> 
> Yes please!!!
> 
> I particularly like:
> 
> > - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
> >   all allocation requests fail early rather than cause disruptive
> >   reclaim (one round of reclaim in this implementation). The OOM killer
> >   is not invoked.
> > - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
> >   and all allocation requests try really hard. The request will fail if the
> >   reclaim cannot make any progress. The OOM killer won't be triggered.
> > - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
> >   and all allocation requests will loop endlessly until they
> >   succeed. This might be really dangerous especially for larger orders.
> 
> There seems to be a good range here, and the two end points are good
> choices.
> I like that only __GFP_NOFAIL triggers the OOM.
> I would like the middle option to be the default.  I think that is what
> many people thought the default was.  I appreciate that making the
> transition might be awkward.

Yeah, turning GFP_KERNEL int GFP_KERNEL | __GFP_RETRY_MAYFAIL would be
hard if possible at all. One of the problems with the current code is
that error paths are checked but there is rarely a sane error handling
strategy implemented on top. So we mostly check for the failure and
return -ENOMEM up the call chain without having a great clue what will
happen up there. And the result might be really unexpected. Say that
some allocation fails on the sys_close() path and returns to the
userspace. a) this syscall is not supposed to return -ENOMEM b) there is
no _transaction_ rollback to have the fd in a sane state to retry later.

Therefore I assume that __GFP_RETRY_MAYFAIL will be slowly added to
those places where the error path strategy is clear.

> Maybe create GFP_DEFAULT which matches the middle option and encourage
> that in new code??
> 
> We would probably want guidelines on when __GFP_NOFAIL is acceptable.
> I assume:
>   - no locks held

This is of course preferable but hard to demand in general. I think that
requiring "no locks which can block oom victim exit" would be more
appropriate, albeit much more fuzzy. But in general locks should be much
smaller problem these days with the async OOM reclaim (oom_reaper) and
with __GFP_NOFAIL gaining access to a part of memory reserves when
hitting the OOM path.

>   - small allocations OK, large allocation need clear justification.

yes

>   - error would be exposed to systemcall

Not only. There are some FS transaction code paths where failure
basically means RO remount and such. This would be acceptable as well.
> ???
> 
> I think it is important to give kernel developers clear options and make
> it easy for them to choose the best option.  This helps to do that.

Yes, I completely agree here. Does the updated documentation in the
patch helps or would you suggest som improvements? 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic
@ 2017-05-24  7:34         ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-24  7:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Vlastimil Babka, linux-mm, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Darrick J. Wong, Heiko Carstens, NeilBrown,
	Jonathan Corbet, Paolo Bonzini, Eric W. Biederman

On Wed 24-05-17 11:06:04, NeilBrown wrote:
> On Tue, May 23 2017, Vlastimil Babka wrote:
> 
> > On 05/16/2017 11:10 AM, Michal Hocko wrote:
> >> So, is there some interest in this? I am not going to push this if there
> >> is a general consensus that we do not need to do anything about the
> >> current situation or need a different approach.
> >
> > After the recent LWN article [1] I think that we should really support
> > marking allocations as failable, without making them too easily failable
> > via __GFP_NORETRY. The __GFP_RETRY_MAY_FAIL flag sounds like a good way
> > to do that without introducing a new __GFP_MAYFAIL. We could also
> > introduce a wrapper such as GFP_KERNEL_MAYFAIL.
> >
> > [1] https://lwn.net/Articles/723317/
> 
> Yes please!!!
> 
> I particularly like:
> 
> > - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
> >   all allocation requests fail early rather than cause disruptive
> >   reclaim (one round of reclaim in this implementation). The OOM killer
> >   is not invoked.
> > - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
> >   and all allocation requests try really hard. The request will fail if the
> >   reclaim cannot make any progress. The OOM killer won't be triggered.
> > - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
> >   and all allocation requests will loop endlessly until they
> >   succeed. This might be really dangerous especially for larger orders.
> 
> There seems to be a good range here, and the two end points are good
> choices.
> I like that only __GFP_NOFAIL triggers the OOM.
> I would like the middle option to be the default.  I think that is what
> many people thought the default was.  I appreciate that making the
> transition might be awkward.

Yeah, turning GFP_KERNEL int GFP_KERNEL | __GFP_RETRY_MAYFAIL would be
hard if possible at all. One of the problems with the current code is
that error paths are checked but there is rarely a sane error handling
strategy implemented on top. So we mostly check for the failure and
return -ENOMEM up the call chain without having a great clue what will
happen up there. And the result might be really unexpected. Say that
some allocation fails on the sys_close() path and returns to the
userspace. a) this syscall is not supposed to return -ENOMEM b) there is
no _transaction_ rollback to have the fd in a sane state to retry later.

Therefore I assume that __GFP_RETRY_MAYFAIL will be slowly added to
those places where the error path strategy is clear.

> Maybe create GFP_DEFAULT which matches the middle option and encourage
> that in new code??
> 
> We would probably want guidelines on when __GFP_NOFAIL is acceptable.
> I assume:
>   - no locks held

This is of course preferable but hard to demand in general. I think that
requiring "no locks which can block oom victim exit" would be more
appropriate, albeit much more fuzzy. But in general locks should be much
smaller problem these days with the async OOM reclaim (oom_reaper) and
with __GFP_NOFAIL gaining access to a part of memory reserves when
hitting the OOM path.

>   - small allocations OK, large allocation need clear justification.

yes

>   - error would be exposed to systemcall

Not only. There are some FS transaction code paths where failure
basically means RO remount and such. This would be acceptable as well.
> ???
> 
> I think it is important to give kernel developers clear options and make
> it easy for them to choose the best option.  This helps to do that.

Yes, I completely agree here. Does the updated documentation in the
patch helps or would you suggest som improvements? 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-03-07 15:48   ` Michal Hocko
  (?)
@ 2017-05-25  1:21   ` NeilBrown
  2017-05-31 11:42       ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: NeilBrown @ 2017-05-25  1:21 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton,
	LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 6692 bytes --]

On Tue, Mar 07 2017, Michal Hocko wrote:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 2bfcfd33e476..60af7937c6f2 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -25,7 +25,7 @@ struct vm_area_struct;
>  #define ___GFP_FS		0x80u
>  #define ___GFP_COLD		0x100u
>  #define ___GFP_NOWARN		0x200u
> -#define ___GFP_REPEAT		0x400u
> +#define ___GFP_RETRY_MAYFAIL		0x400u
>  #define ___GFP_NOFAIL		0x800u
>  #define ___GFP_NORETRY		0x1000u
>  #define ___GFP_MEMALLOC		0x2000u
> @@ -136,26 +136,38 @@ struct vm_area_struct;
>   *
>   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
>   *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - *   _might_ fail.  This depends upon the particular VM implementation.
> + * The default allocator behavior depends on the request size. We have a concept
> + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).

Boundary conditions is one of my pet peeves....
The description here suggests that an allocation of
"1<<PAGE_ALLOC_COSTLY_ORDER" pages is not "costly", which is
inconsistent with how those words would normally be interpreted.

Looking at the code I see comparisons like:

   order < PAGE_ALLOC_COSTLY_ORDER
or
   order >= PAGE_ALLOC_COSTLY_ORDER

which supports the documented (but incoherent) meaning.

But I also see:

  order = max_t(int, PAGE_ALLOC_COSTLY_ORDER - 1, 0);

which looks like it is trying to perform the largest non-costly
allocation, but is making a smaller allocation than necessary.

I would *really* like it if the constant actually meant what its name
implied.

 PAGE_ALLOC_MAX_NON_COSTLY
??

> + * !costly allocations are too essential to fail so they are implicitly
> + * non-failing (with some exceptions like OOM victims might fail) by default while
> + * costly requests try to be not disruptive and back off even without invoking
> + * the OOM killer. The following three modifiers might be used to override some of
> + * these implicit rules
> + *
> + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> + *   return NULL when direct reclaim and memory compaction have failed to allow
> + *   the allocation to succeed.  The OOM killer is not called with the current
> + *   implementation. This is a default mode for costly allocations.

The name here is "NORETRY", but the text says "not retry indefinitely".
So does it retry or not?
I would assuming it "tried" once, and only once.
However it could be that a "try" is not a simple well defined task.
Maybe some escalation happens on the 2nd or 3rd "try", so they are really
trying different things?

The word "indefinitely" implies there is a definite limit.  It might
help to say what that is, or at least say that it is small.

Also, this documentation is phrased to tell the VM implementor what is,
or is not, allowed.  Most readers will be more interested is the
responsibilities of the caller.

  __GFP_NORETRY: The VM implementation will not retry after all
     reasonable avenues for finding free memory have been pursued.  The
     implementation may sleep (i.e. call 'schedule()'), but only while
     waiting for another task to perform some specific action.
     The caller must handle failure.  This flag is suitable when failure can
     easily be handled at small cost, such as reduced throughput.
  

> + *
> + * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
> + *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
> + *   The OOM killer is excluded because this would be too disruptive. This can be
> + *   used to override non-failing default behavior for !costly requests as well as
> + *   fortify costly requests.

What does "Try hard" mean?
In part, it means "retry everything a few more times", I guess in the
hope that something happened in the mean time.
It also seems to mean waiting for compaction to happen, which I
guess is only relevant for >PAGE_SIZE allocations?
Maybe it also means waiting for page-out to complete.
So the summary would be that it waits for a little while, hoping for a
miracle.

   __GFP_RETRY_MAYFAIL:  The VM implementation will retry memory reclaim
     procedures that have previously failed if there is some indication
     that progress has been made else where.  It can wait for other
     tasks to attempt high level approaches to freeing memory such as
     compaction (which removed fragmentation) and page-out.
     There is still a definite limit to the number of retries, but it is
     a larger limit than with __GFP_NORERY.
     Allocations with this flag may fail, but only when there is
     genuinely little unused memory.  While these allocations do not
     directly trigger the OOM killer, their failure indicates that the
     system is likely to need to use the OOM killer soon.
     The caller must handle failure, but can reasonably do so by failing
     a higher-level request, or completing it only in a much less
     efficient manner.
     If the allocation does fail, and the caller is in a position to
     free some non-essential memory, doing so could benefit the system
     as a whole.
    


>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>   *   cannot handle allocation failures. New users should be evaluated carefully
>   *   (and the flag should be used only when there is no reasonable failure
>   *   policy) but it is definitely preferable to use the flag rather than
> - *   opencode endless loop around allocator.
> - *
> - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> - *   return NULL when direct reclaim and memory compaction have failed to allow
> - *   the allocation to succeed.  The OOM killer is not called with the current
> - *   implementation.
> + *   opencode endless loop around allocator. Using this flag for costly allocations
> + *   is _highly_ discouraged.

Should this explicitly say that the OOM killer might be invoked in an attempt
to satisfy this allocation?  Is the OOM killer *only* invoked from
allocations with __GFP_NOFAIL ?
Maybe be extra explicit "The allocation could block indefinitely but
will never return with failure.  Testing for failure is pointless.".


I've probably got several specifics wrong.  I've tried to answer the
questions that I would like to see answered by the documentation.   If
you can fix it up so that those questions are answered correctly, that
would be great.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-05-25  1:21   ` NeilBrown
@ 2017-05-31 11:42       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-31 11:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

[I am sorry but I didn't get to this earlier]

On Thu 25-05-17 11:21:05, NeilBrown wrote:
> On Tue, Mar 07 2017, Michal Hocko wrote:
> 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 2bfcfd33e476..60af7937c6f2 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -25,7 +25,7 @@ struct vm_area_struct;
> >  #define ___GFP_FS		0x80u
> >  #define ___GFP_COLD		0x100u
> >  #define ___GFP_NOWARN		0x200u
> > -#define ___GFP_REPEAT		0x400u
> > +#define ___GFP_RETRY_MAYFAIL		0x400u
> >  #define ___GFP_NOFAIL		0x800u
> >  #define ___GFP_NORETRY		0x1000u
> >  #define ___GFP_MEMALLOC		0x2000u
> > @@ -136,26 +136,38 @@ struct vm_area_struct;
> >   *
> >   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> >   *
> > - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> > - *   _might_ fail.  This depends upon the particular VM implementation.
> > + * The default allocator behavior depends on the request size. We have a concept
> > + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
> 
> Boundary conditions is one of my pet peeves....
> The description here suggests that an allocation of
> "1<<PAGE_ALLOC_COSTLY_ORDER" pages is not "costly", which is
> inconsistent with how those words would normally be interpreted.
> 
> Looking at the code I see comparisons like:
> 
>    order < PAGE_ALLOC_COSTLY_ORDER
> or
>    order >= PAGE_ALLOC_COSTLY_ORDER
> 
> which supports the documented (but incoherent) meaning.
> 
> But I also see:
> 
>   order = max_t(int, PAGE_ALLOC_COSTLY_ORDER - 1, 0);

this smells fishy. Very similarly to other PAGE_ALLOC_COSTLY_ORDER usage
out of the mm proper. Many of them can be simply changed to use
kvmalloc. I will put this on my todo list for a later cleanup. There
shouldn't be any real need to care about PAGE_ALLOC_COSTLY_ORDER.

> which looks like it is trying to perform the largest non-costly
> allocation, but is making a smaller allocation than necessary.
> 
> I would *really* like it if the constant actually meant what its name
> implied.
> 
>  PAGE_ALLOC_MAX_NON_COSTLY
> ??

Yeah, I can see how this can be confusing. Maybe this is worth a
separate cleanup? I wouldn't like to conflate it with this patch.
 
> > + * !costly allocations are too essential to fail so they are implicitly
> > + * non-failing (with some exceptions like OOM victims might fail) by default while
> > + * costly requests try to be not disruptive and back off even without invoking
> > + * the OOM killer. The following three modifiers might be used to override some of
> > + * these implicit rules
> > + *
> > + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> > + *   return NULL when direct reclaim and memory compaction have failed to allow
> > + *   the allocation to succeed.  The OOM killer is not called with the current
> > + *   implementation. This is a default mode for costly allocations.
> 
> The name here is "NORETRY", but the text says "not retry indefinitely".
> So does it retry or not?
> I would assuming it "tried" once, and only once.
> However it could be that a "try" is not a simple well defined task.

This is the case unfortunatelly. E.g. we have that node_reclaim thing
which will try to reclaim a local node before falling back to other
nodes. And that counts as a direct reclaim attempt and that happens in
the allocator fast path. We do get to the allocator slow path where we
do the full direct reclaim attempt and give up only if that fails.
Confusing? I can see how...

> Maybe some escalation happens on the 2nd or 3rd "try", so they are really
> trying different things?
> 
> The word "indefinitely" implies there is a definite limit.  It might
> help to say what that is, or at least say that it is small.

OK.
 
> Also, this documentation is phrased to tell the VM implementor what is,
> or is not, allowed.  Most readers will be more interested is the
> responsibilities of the caller.
> 
>   __GFP_NORETRY: The VM implementation will not retry after all
>      reasonable avenues for finding free memory have been pursued.  The
>      implementation may sleep (i.e. call 'schedule()'), but only while
>      waiting for another task to perform some specific action.
>      The caller must handle failure.  This flag is suitable when failure can
>      easily be handled at small cost, such as reduced throughput.

The above is not precise. What about the following?

__GFP_NORETRY: The VM implementation will not try only very lightweight
memory direct reclaim to get some memory under memory pressure (thus
it can sleep). It will avoid disruptive actions like OOM killer. The
caller must handle the failure which is quite likely to happen under
heavy memory pressure. The flag is suitable when failure can easily be
handled at small cost, such as reduced throughput


> > + *
> > + * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
> > + *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
> > + *   The OOM killer is excluded because this would be too disruptive. This can be
> > + *   used to override non-failing default behavior for !costly requests as well as
> > + *   fortify costly requests.
> 
> What does "Try hard" mean?
> In part, it means "retry everything a few more times", I guess in the
> hope that something happened in the mean time.
> It also seems to mean waiting for compaction to happen, which I
> guess is only relevant for >PAGE_SIZE allocations?
> Maybe it also means waiting for page-out to complete.
> So the summary would be that it waits for a little while, hoping for a
> miracle.
> 
>    __GFP_RETRY_MAYFAIL:  The VM implementation will retry memory reclaim
>      procedures that have previously failed if there is some indication
>      that progress has been made else where.  It can wait for other
>      tasks to attempt high level approaches to freeing memory such as
>      compaction (which removed fragmentation) and page-out.
>      There is still a definite limit to the number of retries, but it is
>      a larger limit than with __GFP_NORERY.
>      Allocations with this flag may fail, but only when there is
>      genuinely little unused memory. While these allocations do not
>      directly trigger the OOM killer, their failure indicates that the
>      system is likely to need to use the OOM killer soon.
>      The caller must handle failure, but can reasonably do so by failing
>      a higher-level request, or completing it only in a much less
>      efficient manner.
>      If the allocation does fail, and the caller is in a position to
>      free some non-essential memory, doing so could benefit the system
>      as a whole.

OK

> >   *
> >   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >   *   cannot handle allocation failures. New users should be evaluated carefully
> >   *   (and the flag should be used only when there is no reasonable failure
> >   *   policy) but it is definitely preferable to use the flag rather than
> > - *   opencode endless loop around allocator.
> > - *
> > - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> > - *   return NULL when direct reclaim and memory compaction have failed to allow
> > - *   the allocation to succeed.  The OOM killer is not called with the current
> > - *   implementation.
> > + *   opencode endless loop around allocator. Using this flag for costly allocations
> > + *   is _highly_ discouraged.
> 
> Should this explicitly say that the OOM killer might be invoked in an attempt
> to satisfy this allocation?

Well that depends. Normally it does but E.g. __GFP_NOFAIL | GFP_NOFS
will not trigger the OOM killer because we never trigger OOM killer for
NOFS requests as a lot of metadata might be pinned under the current fs
context.

> Is the OOM killer *only* invoked from
> allocations with __GFP_NOFAIL ?

No. Most !costly allocation requests with __GFP_DIRECT_RECLAIM are
allowed to trigger the OOM killer. There are some exceptions described
in  __alloc_pages_may_oom. I am not sure we want to docment those in the
high level documentation. 

> Maybe be extra explicit "The allocation could block indefinitely but
> will never return with failure.  Testing for failure is pointless.".

OK

> I've probably got several specifics wrong.  I've tried to answer the
> questions that I would like to see answered by the documentation.   If
> you can fix it up so that those questions are answered correctly, that
> would be great.

This is what I will fold into the patch:
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 60af7937c6f2..9c96c739d726 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -144,23 +144,40 @@ struct vm_area_struct;
  * the OOM killer. The following three modifiers might be used to override some of
  * these implicit rules
  *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation. This is a default mode for costly allocations.
- *
- * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
- *   The OOM killer is excluded because this would be too disruptive. This can be
- *   used to override non-failing default behavior for !costly requests as well as
- *   fortify costly requests.
+ * __GFP_NORETRY: The VM implementation will not try only very lightweight
+ *   memory direct reclaim to get some memory under memory pressure (thus
+ *   it can sleep). It will avoid disruptive actions like OOM killer. The
+ *   caller must handle the failure which is quite likely to happen under
+ *   heavy memory pressure. The flag is suitable when failure can easily be
+ *   handled at small cost, such as reduced throughput
+ *
+ * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
+ *   procedures that have previously failed if there is some indication
+ *   that progress has been made else where.  It can wait for other
+ *   tasks to attempt high level approaches to freeing memory such as
+ *   compaction (which removed fragmentation) and page-out.
+ *   There is still a definite limit to the number of retries, but it is
+ *   a larger limit than with __GFP_NORERY.
+ *   Allocations with this flag may fail, but only when there is
+ *   genuinely little unused memory. While these allocations do not
+ *   directly trigger the OOM killer, their failure indicates that
+ *   the system is likely to need to use the OOM killer soon.  The
+ *   caller must handle failure, but can reasonably do so by failing
+ *   a higher-level request, or completing it only in a much less
+ *   efficient manner.
+ *   If the allocation does fail, and the caller is in a position to
+ *   free some non-essential memory, doing so could benefit the system
+ *   as a whole.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- *   cannot handle allocation failures. New users should be evaluated carefully
- *   (and the flag should be used only when there is no reasonable failure
- *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator. Using this flag for costly allocations
- *   is _highly_ discouraged.
+ *   cannot handle allocation failures. The allocation could block
+ *   indefinitely but will never return with failure. Testing for
+ *   failure is pointless.
+ *   New users should be evaluated carefully (and the flag should be
+ *   used only when there is no reasonable failure policy) but it is
+ *   definitely preferable to use the flag rather than opencode endless
+ *   loop around allocator.
+ *   Using this flag for costly allocations is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
@ 2017-05-31 11:42       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-05-31 11:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

[I am sorry but I didn't get to this earlier]

On Thu 25-05-17 11:21:05, NeilBrown wrote:
> On Tue, Mar 07 2017, Michal Hocko wrote:
> 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 2bfcfd33e476..60af7937c6f2 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -25,7 +25,7 @@ struct vm_area_struct;
> >  #define ___GFP_FS		0x80u
> >  #define ___GFP_COLD		0x100u
> >  #define ___GFP_NOWARN		0x200u
> > -#define ___GFP_REPEAT		0x400u
> > +#define ___GFP_RETRY_MAYFAIL		0x400u
> >  #define ___GFP_NOFAIL		0x800u
> >  #define ___GFP_NORETRY		0x1000u
> >  #define ___GFP_MEMALLOC		0x2000u
> > @@ -136,26 +136,38 @@ struct vm_area_struct;
> >   *
> >   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> >   *
> > - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> > - *   _might_ fail.  This depends upon the particular VM implementation.
> > + * The default allocator behavior depends on the request size. We have a concept
> > + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
> 
> Boundary conditions is one of my pet peeves....
> The description here suggests that an allocation of
> "1<<PAGE_ALLOC_COSTLY_ORDER" pages is not "costly", which is
> inconsistent with how those words would normally be interpreted.
> 
> Looking at the code I see comparisons like:
> 
>    order < PAGE_ALLOC_COSTLY_ORDER
> or
>    order >= PAGE_ALLOC_COSTLY_ORDER
> 
> which supports the documented (but incoherent) meaning.
> 
> But I also see:
> 
>   order = max_t(int, PAGE_ALLOC_COSTLY_ORDER - 1, 0);

this smells fishy. Very similarly to other PAGE_ALLOC_COSTLY_ORDER usage
out of the mm proper. Many of them can be simply changed to use
kvmalloc. I will put this on my todo list for a later cleanup. There
shouldn't be any real need to care about PAGE_ALLOC_COSTLY_ORDER.

> which looks like it is trying to perform the largest non-costly
> allocation, but is making a smaller allocation than necessary.
> 
> I would *really* like it if the constant actually meant what its name
> implied.
> 
>  PAGE_ALLOC_MAX_NON_COSTLY
> ??

Yeah, I can see how this can be confusing. Maybe this is worth a
separate cleanup? I wouldn't like to conflate it with this patch.
 
> > + * !costly allocations are too essential to fail so they are implicitly
> > + * non-failing (with some exceptions like OOM victims might fail) by default while
> > + * costly requests try to be not disruptive and back off even without invoking
> > + * the OOM killer. The following three modifiers might be used to override some of
> > + * these implicit rules
> > + *
> > + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> > + *   return NULL when direct reclaim and memory compaction have failed to allow
> > + *   the allocation to succeed.  The OOM killer is not called with the current
> > + *   implementation. This is a default mode for costly allocations.
> 
> The name here is "NORETRY", but the text says "not retry indefinitely".
> So does it retry or not?
> I would assuming it "tried" once, and only once.
> However it could be that a "try" is not a simple well defined task.

This is the case unfortunatelly. E.g. we have that node_reclaim thing
which will try to reclaim a local node before falling back to other
nodes. And that counts as a direct reclaim attempt and that happens in
the allocator fast path. We do get to the allocator slow path where we
do the full direct reclaim attempt and give up only if that fails.
Confusing? I can see how...

> Maybe some escalation happens on the 2nd or 3rd "try", so they are really
> trying different things?
> 
> The word "indefinitely" implies there is a definite limit.  It might
> help to say what that is, or at least say that it is small.

OK.
 
> Also, this documentation is phrased to tell the VM implementor what is,
> or is not, allowed.  Most readers will be more interested is the
> responsibilities of the caller.
> 
>   __GFP_NORETRY: The VM implementation will not retry after all
>      reasonable avenues for finding free memory have been pursued.  The
>      implementation may sleep (i.e. call 'schedule()'), but only while
>      waiting for another task to perform some specific action.
>      The caller must handle failure.  This flag is suitable when failure can
>      easily be handled at small cost, such as reduced throughput.

The above is not precise. What about the following?

__GFP_NORETRY: The VM implementation will not try only very lightweight
memory direct reclaim to get some memory under memory pressure (thus
it can sleep). It will avoid disruptive actions like OOM killer. The
caller must handle the failure which is quite likely to happen under
heavy memory pressure. The flag is suitable when failure can easily be
handled at small cost, such as reduced throughput


> > + *
> > + * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
> > + *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
> > + *   The OOM killer is excluded because this would be too disruptive. This can be
> > + *   used to override non-failing default behavior for !costly requests as well as
> > + *   fortify costly requests.
> 
> What does "Try hard" mean?
> In part, it means "retry everything a few more times", I guess in the
> hope that something happened in the mean time.
> It also seems to mean waiting for compaction to happen, which I
> guess is only relevant for >PAGE_SIZE allocations?
> Maybe it also means waiting for page-out to complete.
> So the summary would be that it waits for a little while, hoping for a
> miracle.
> 
>    __GFP_RETRY_MAYFAIL:  The VM implementation will retry memory reclaim
>      procedures that have previously failed if there is some indication
>      that progress has been made else where.  It can wait for other
>      tasks to attempt high level approaches to freeing memory such as
>      compaction (which removed fragmentation) and page-out.
>      There is still a definite limit to the number of retries, but it is
>      a larger limit than with __GFP_NORERY.
>      Allocations with this flag may fail, but only when there is
>      genuinely little unused memory. While these allocations do not
>      directly trigger the OOM killer, their failure indicates that the
>      system is likely to need to use the OOM killer soon.
>      The caller must handle failure, but can reasonably do so by failing
>      a higher-level request, or completing it only in a much less
>      efficient manner.
>      If the allocation does fail, and the caller is in a position to
>      free some non-essential memory, doing so could benefit the system
>      as a whole.

OK

> >   *
> >   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >   *   cannot handle allocation failures. New users should be evaluated carefully
> >   *   (and the flag should be used only when there is no reasonable failure
> >   *   policy) but it is definitely preferable to use the flag rather than
> > - *   opencode endless loop around allocator.
> > - *
> > - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> > - *   return NULL when direct reclaim and memory compaction have failed to allow
> > - *   the allocation to succeed.  The OOM killer is not called with the current
> > - *   implementation.
> > + *   opencode endless loop around allocator. Using this flag for costly allocations
> > + *   is _highly_ discouraged.
> 
> Should this explicitly say that the OOM killer might be invoked in an attempt
> to satisfy this allocation?

Well that depends. Normally it does but E.g. __GFP_NOFAIL | GFP_NOFS
will not trigger the OOM killer because we never trigger OOM killer for
NOFS requests as a lot of metadata might be pinned under the current fs
context.

> Is the OOM killer *only* invoked from
> allocations with __GFP_NOFAIL ?

No. Most !costly allocation requests with __GFP_DIRECT_RECLAIM are
allowed to trigger the OOM killer. There are some exceptions described
in  __alloc_pages_may_oom. I am not sure we want to docment those in the
high level documentation. 

> Maybe be extra explicit "The allocation could block indefinitely but
> will never return with failure.  Testing for failure is pointless.".

OK

> I've probably got several specifics wrong.  I've tried to answer the
> questions that I would like to see answered by the documentation.   If
> you can fix it up so that those questions are answered correctly, that
> would be great.

This is what I will fold into the patch:
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 60af7937c6f2..9c96c739d726 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -144,23 +144,40 @@ struct vm_area_struct;
  * the OOM killer. The following three modifiers might be used to override some of
  * these implicit rules
  *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation. This is a default mode for costly allocations.
- *
- * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
- *   The OOM killer is excluded because this would be too disruptive. This can be
- *   used to override non-failing default behavior for !costly requests as well as
- *   fortify costly requests.
+ * __GFP_NORETRY: The VM implementation will not try only very lightweight
+ *   memory direct reclaim to get some memory under memory pressure (thus
+ *   it can sleep). It will avoid disruptive actions like OOM killer. The
+ *   caller must handle the failure which is quite likely to happen under
+ *   heavy memory pressure. The flag is suitable when failure can easily be
+ *   handled at small cost, such as reduced throughput
+ *
+ * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
+ *   procedures that have previously failed if there is some indication
+ *   that progress has been made else where.  It can wait for other
+ *   tasks to attempt high level approaches to freeing memory such as
+ *   compaction (which removed fragmentation) and page-out.
+ *   There is still a definite limit to the number of retries, but it is
+ *   a larger limit than with __GFP_NORERY.
+ *   Allocations with this flag may fail, but only when there is
+ *   genuinely little unused memory. While these allocations do not
+ *   directly trigger the OOM killer, their failure indicates that
+ *   the system is likely to need to use the OOM killer soon.  The
+ *   caller must handle failure, but can reasonably do so by failing
+ *   a higher-level request, or completing it only in a much less
+ *   efficient manner.
+ *   If the allocation does fail, and the caller is in a position to
+ *   free some non-essential memory, doing so could benefit the system
+ *   as a whole.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- *   cannot handle allocation failures. New users should be evaluated carefully
- *   (and the flag should be used only when there is no reasonable failure
- *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator. Using this flag for costly allocations
- *   is _highly_ discouraged.
+ *   cannot handle allocation failures. The allocation could block
+ *   indefinitely but will never return with failure. Testing for
+ *   failure is pointless.
+ *   New users should be evaluated carefully (and the flag should be
+ *   used only when there is no reasonable failure policy) but it is
+ *   definitely preferable to use the flag rather than opencode endless
+ *   loop around allocator.
+ *   Using this flag for costly allocations is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-03-07 15:48   ` Michal Hocko
  (?)
  (?)
@ 2017-06-03  2:24   ` Wei Yang
  2017-06-05  6:43       ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: Wei Yang @ 2017-06-03  2:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 26771 bytes --]

Hi, Michal

Just go through your patch.

I have one question and one suggestion as below.

One suggestion:

This patch does two things to me:
1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
2. Adjust the logic in page_alloc to provide the middle semantic

My suggestion is to split these two task into two patches, so that readers
could catch your fundamental logic change easily.

On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
>From: Michal Hocko <mhocko@suse.com>
>
>__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
>the page allocator. This has been true but only for allocations requests
>larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
>smaller sizes. This is a bit unfortunate because there is no way to
>express the same semantic for those requests and they are considered too
>important to fail so they might end up looping in the page allocator for
>ever, similarly to GFP_NOFAIL requests.
>
>Now that the whole tree has been cleaned up and accidental or misled
>usage of __GFP_REPEAT flag has been removed for !costly requests we can
>give the original flag a better name and more importantly a more useful
>semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that
>the allocator would try really hard but there is no promise of a
>success. This will work independent of the order and overrides the
>default allocator behavior. Page allocator users have several levels of
>guarantee vs. cost options (take GFP_KERNEL as an example)
>- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
>  attempt to free memory at all. The most light weight mode which even
>  doesn't kick the background reclaim. Should be used carefully because
>  it might deplete the memory and the next user might hit the more
>  aggressive reclaim
>- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
>  allocation without any attempt to free memory from the current context
>  but can wake kswapd to reclaim memory if the zone is below the low
>  watermark. Can be used from either atomic contexts or when the request
>  is a performance optimization and there is another fallback for a slow
>  path.
>- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
>  sleeping allocation with an expensive fallback so it can access some
>  portion of memory reserves. Usually used from interrupt/bh context with
>  an expensive slow path fallback.
>- GFP_KERNEL - both background and direct reclaim are allowed and the
>  _default_ page allocator behavior is used. That means that !costly
>  allocation requests are basically nofail (unless the requesting task
>  is killed by the OOM killer) and costly will fail early rather than
>  cause disruptive reclaim.
>- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
>  all allocation requests fail early rather than cause disruptive
>  reclaim (one round of reclaim in this implementation). The OOM killer
>  is not invoked.
>- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
>  and all allocation requests try really hard. The request will fail if the
>  reclaim cannot make any progress. The OOM killer won't be triggered.
>- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
>  and all allocation requests will loop endlessly until they
>  succeed. This might be really dangerous especially for larger orders.
>
>Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
>they already had their semantic. No new users are added.
>__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
>there is no progress and we have already passed the OOM point. This
>means that all the reclaim opportunities have been exhausted except the
>most disruptive one (the OOM killer) and a user defined fallback
>behavior is more sensible than keep retrying in the page allocator.
>
>Signed-off-by: Michal Hocko <mhocko@suse.com>
>---
> Documentation/DMA-ISA-LPC.txt                |  2 +-
> arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
> arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
> drivers/mmc/host/wbsd.c                      |  2 +-
> drivers/s390/char/vmcp.c                     |  2 +-
> drivers/target/target_core_transport.c       |  2 +-
> drivers/vhost/net.c                          |  2 +-
> drivers/vhost/scsi.c                         |  2 +-
> drivers/vhost/vsock.c                        |  2 +-
> fs/btrfs/check-integrity.c                   |  2 +-
> fs/btrfs/raid56.c                            |  2 +-
> include/linux/gfp.h                          | 32 +++++++++++++++++++---------
> include/linux/slab.h                         |  3 ++-
> include/trace/events/mmflags.h               |  2 +-
> mm/hugetlb.c                                 |  4 ++--
> mm/internal.h                                |  2 +-
> mm/page_alloc.c                              | 14 +++++++++---
> mm/sparse-vmemmap.c                          |  4 ++--
> mm/util.c                                    |  6 +++---
> mm/vmalloc.c                                 |  2 +-
> mm/vmscan.c                                  |  8 +++----
> net/core/dev.c                               |  6 +++---
> net/core/skbuff.c                            |  2 +-
> net/sched/sch_fq.c                           |  2 +-
> tools/perf/builtin-kmem.c                    |  2 +-
> 25 files changed, 66 insertions(+), 45 deletions(-)
>
>diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
>index c41331398752..7a065ac4a9d1 100644
>--- a/Documentation/DMA-ISA-LPC.txt
>+++ b/Documentation/DMA-ISA-LPC.txt
>@@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc.
> 
> Unfortunately the memory available for ISA DMA is scarce so unless you
> allocate the memory during boot-up it's a good idea to also pass
>-__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder.
>+__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
> 
> (This scarcity also means that you should allocate the buffer as
> early as possible and not release it until the driver is unloaded.)
>diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
>index cd5e7aa8cc34..1b835aa5b4d1 100644
>--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
>+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
>@@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
> 	return (pgd_t *)__get_free_page(PGALLOC_GFP);
> #else
> 	struct page *page;
>-	page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4);
>+	page = alloc_pages(PGALLOC_GFP | __GFP_RETRY_MAYFAIL, 4);
> 	if (!page)
> 		return NULL;
> 	return (pgd_t *) page_address(page);
>diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
>index 8c68145ba1bd..8ad2c309f14a 100644
>--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
>+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
>@@ -93,7 +93,7 @@ int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
> 	}
> 
> 	if (!hpt)
>-		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT
>+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL
> 				       |__GFP_NOWARN, order - PAGE_SHIFT);
> 
> 	if (!hpt)
>diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c
>index bd04e8bae010..b58fa5b5b972 100644
>--- a/drivers/mmc/host/wbsd.c
>+++ b/drivers/mmc/host/wbsd.c
>@@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma)
> 	 * order for ISA to be able to DMA to it.
> 	 */
> 	host->dma_buffer = kmalloc(WBSD_DMA_SIZE,
>-		GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
>+		GFP_NOIO | GFP_DMA | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> 	if (!host->dma_buffer)
> 		goto free;
> 
>diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
>index 65f5a794f26d..98749fa817da 100644
>--- a/drivers/s390/char/vmcp.c
>+++ b/drivers/s390/char/vmcp.c
>@@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count,
> 	}
> 	if (!session->response)
> 		session->response = (char *)__get_free_pages(GFP_KERNEL
>-						| __GFP_REPEAT | GFP_DMA,
>+						| __GFP_RETRY_MAYFAIL | GFP_DMA,
> 						get_order(session->bufsize));
> 	if (!session->response) {
> 		mutex_unlock(&session->mutex);
>diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
>index 434d9d693989..e585d301c665 100644
>--- a/drivers/target/target_core_transport.c
>+++ b/drivers/target/target_core_transport.c
>@@ -251,7 +251,7 @@ int transport_alloc_session_tags(struct se_session *se_sess,
> 	int rc;
> 
> 	se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
>-					GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
>+					GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
> 	if (!se_sess->sess_cmd_map) {
> 		se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
> 		if (!se_sess->sess_cmd_map) {
>diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>index f61f852d6cfd..7d2c4ce6d8d1 100644
>--- a/drivers/vhost/net.c
>+++ b/drivers/vhost/net.c
>@@ -817,7 +817,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> 	struct vhost_virtqueue **vqs;
> 	int i;
> 
>-	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
>+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!n)
> 		return -ENOMEM;
> 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
>diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>index fd6c8b66f06f..ff02a942c4d5 100644
>--- a/drivers/vhost/scsi.c
>+++ b/drivers/vhost/scsi.c
>@@ -1404,7 +1404,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
> 	struct vhost_virtqueue **vqs;
> 	int r = -ENOMEM, i;
> 
>-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
>+	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
> 	if (!vs) {
> 		vs = vzalloc(sizeof(*vs));
> 		if (!vs)
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index d403c647ba56..5b76242d73e3 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -460,7 +460,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
> 	/* This struct is large and allocation could fail, fall back to vmalloc
> 	 * if there is no other way.
> 	 */
>-	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
>+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!vsock)
> 		return -ENOMEM;
> 
>diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
>index ab14c2e635ca..e334ed2b7e64 100644
>--- a/fs/btrfs/check-integrity.c
>+++ b/fs/btrfs/check-integrity.c
>@@ -2923,7 +2923,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
> 		       fs_info->sectorsize, PAGE_SIZE);
> 		return -1;
> 	}
>-	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
>+	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
> 	if (!state) {
> 		state = vzalloc(sizeof(*state));
> 		if (!state) {
>diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>index 1571bf26dc07..94af3db1d0e4 100644
>--- a/fs/btrfs/raid56.c
>+++ b/fs/btrfs/raid56.c
>@@ -218,7 +218,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info)
> 	 * of a failing mount.
> 	 */
> 	table_size = sizeof(*table) + sizeof(*h) * num_entries;
>-	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
>+	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
> 	if (!table) {
> 		table = vzalloc(table_size);
> 		if (!table)
>diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>index 2bfcfd33e476..60af7937c6f2 100644
>--- a/include/linux/gfp.h
>+++ b/include/linux/gfp.h
>@@ -25,7 +25,7 @@ struct vm_area_struct;
> #define ___GFP_FS		0x80u
> #define ___GFP_COLD		0x100u
> #define ___GFP_NOWARN		0x200u
>-#define ___GFP_REPEAT		0x400u
>+#define ___GFP_RETRY_MAYFAIL		0x400u
> #define ___GFP_NOFAIL		0x800u
> #define ___GFP_NORETRY		0x1000u
> #define ___GFP_MEMALLOC		0x2000u
>@@ -136,26 +136,38 @@ struct vm_area_struct;
>  *
>  * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
>  *
>- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>- *   _might_ fail.  This depends upon the particular VM implementation.
>+ * The default allocator behavior depends on the request size. We have a concept
>+ * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
>+ * !costly allocations are too essential to fail so they are implicitly
>+ * non-failing (with some exceptions like OOM victims might fail) by default while
>+ * costly requests try to be not disruptive and back off even without invoking
>+ * the OOM killer. The following three modifiers might be used to override some of
>+ * these implicit rules
>+ *
>+ * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
>+ *   return NULL when direct reclaim and memory compaction have failed to allow
>+ *   the allocation to succeed.  The OOM killer is not called with the current
>+ *   implementation. This is a default mode for costly allocations.
>+ *
>+ * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
>+ *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
>+ *   The OOM killer is excluded because this would be too disruptive. This can be
>+ *   used to override non-failing default behavior for !costly requests as well as
>+ *   fortify costly requests.
>  *
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  *   cannot handle allocation failures. New users should be evaluated carefully
>  *   (and the flag should be used only when there is no reasonable failure
>  *   policy) but it is definitely preferable to use the flag rather than
>- *   opencode endless loop around allocator.
>- *
>- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
>- *   return NULL when direct reclaim and memory compaction have failed to allow
>- *   the allocation to succeed.  The OOM killer is not called with the current
>- *   implementation.
>+ *   opencode endless loop around allocator. Using this flag for costly allocations
>+ *   is _highly_ discouraged.
>  */
> #define __GFP_IO	((__force gfp_t)___GFP_IO)
> #define __GFP_FS	((__force gfp_t)___GFP_FS)
> #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
>-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
>+#define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
> #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
> #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
> 
>diff --git a/include/linux/slab.h b/include/linux/slab.h
>index 3c37a8c51921..064901ac3e37 100644
>--- a/include/linux/slab.h
>+++ b/include/linux/slab.h
>@@ -469,7 +469,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
>  *
>  * %__GFP_NOWARN - If allocation fails, don't issue any warnings.
>  *
>- * %__GFP_REPEAT - If allocation fails initially, try once more before failing.
>+ * %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
>+ *   eventually.
>  *
>  * There are other flags available as well, but these are not intended
>  * for general use, and so are not documented here. For a full list of
>diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>index 304ff94363b2..418142a4efce 100644
>--- a/include/trace/events/mmflags.h
>+++ b/include/trace/events/mmflags.h
>@@ -34,7 +34,7 @@
> 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
> 	{(unsigned long)__GFP_COLD,		"__GFP_COLD"},		\
> 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
>-	{(unsigned long)__GFP_REPEAT,		"__GFP_REPEAT"},	\
>+	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
> 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
> 	{(unsigned long)__GFP_NORETRY,		"__GFP_NORETRY"},	\
> 	{(unsigned long)__GFP_COMP,		"__GFP_COMP"},		\
>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index a7aa811b7d14..dc598bfe4ce9 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1369,7 +1369,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> 
> 	page = __alloc_pages_node(nid,
> 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>-						__GFP_REPEAT|__GFP_NOWARN,
>+						__GFP_RETRY_MAYFAIL|__GFP_NOWARN,
> 		huge_page_order(h));
> 	if (page) {
> 		prep_new_huge_page(h, page, nid);
>@@ -1510,7 +1510,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> 		struct vm_area_struct *vma, unsigned long addr, int nid)
> {
> 	int order = huge_page_order(h);
>-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
>+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
> 	unsigned int cpuset_mems_cookie;
> 
> 	/*
>diff --git a/mm/internal.h b/mm/internal.h
>index 823a7a89099b..8e6d347f70fb 100644
>--- a/mm/internal.h
>+++ b/mm/internal.h
>@@ -23,7 +23,7 @@
>  * hints such as HIGHMEM usage.
>  */
> #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
>-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
>+			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
> 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
> 			__GFP_ATOMIC)
> 
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 5238b87aec91..bfe4a9bad0f8 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -3181,6 +3181,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> 	/* The OOM killer will not help higher order allocs */
> 	if (order > PAGE_ALLOC_COSTLY_ORDER)
> 		goto out;
>+	/*
>+	 * We have already exhausted all our reclaim opportunities without any
>+	 * success so it is time to admit defeat. We will skip the OOM killer
>+	 * because it is very likely that the caller has a more reasonable
>+	 * fallback than shooting a random task.
>+	 */
>+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>+		goto out;
> 	/* The OOM killer does not needlessly kill tasks for lowmem */
> 	if (ac->high_zoneidx < ZONE_NORMAL)
> 		goto out;
>@@ -3309,7 +3317,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> 	}
> 
> 	/*
>-	 * !costly requests are much more important than __GFP_REPEAT
>+	 * !costly requests are much more important than __GFP_RETRY_MAYFAIL
> 	 * costly ones because they are de facto nofail and invoke OOM
> 	 * killer to move on while costly can fail and users are ready
> 	 * to cope with that. 1/4 retries is rather arbitrary but we
>@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> 
> 	/*
> 	 * Do not retry costly high order allocations unless they are
>-	 * __GFP_REPEAT
>+	 * __GFP_RETRY_MAYFAIL
> 	 */
>-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> 		goto nopage;

One question:

From your change log, it mentions will provide the same semantic for !costly
allocations. While the logic here is the same as before.

For a !costly allocation with __GFP_REPEAT flag, the difference after this
patch is no OOM will be invoked, while it will still continue in the loop.

Maybe I don't catch your point in this message:

  __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
  the page allocator. This has been true but only for allocations requests
  larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
  smaller sizes. This is a bit unfortunate because there is no way to
  express the same semantic for those requests and they are considered too
  important to fail so they might end up looping in the page allocator for
  ever, similarly to GFP_NOFAIL requests.

I thought you will provide the same semantic to !costly allocation, or I
misunderstand?

> 
> 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>index 574c67b663fe..b21ba0dfe102 100644
>--- a/mm/sparse-vmemmap.c
>+++ b/mm/sparse-vmemmap.c
>@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
> 
> 		if (node_state(node, N_HIGH_MEMORY))
> 			page = alloc_pages_node(
>-				node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
>+				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> 				get_order(size));
> 		else
> 			page = alloc_pages(
>-				GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
>+				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> 				get_order(size));
> 		if (page)
> 			return page_address(page);
>diff --git a/mm/util.c b/mm/util.c
>index 6ed3e49bf1e5..885a78d1941b 100644
>--- a/mm/util.c
>+++ b/mm/util.c
>@@ -339,7 +339,7 @@ EXPORT_SYMBOL(vm_mmap);
>  * Uses kmalloc to get the memory but if the allocation fails then falls back
>  * to the vmalloc allocator. Use kvfree for freeing the memory.
>  *
>- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
>+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
>  * is supported only for large (>32kB) allocations, and it should be used only if
>  * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
>  *
>@@ -364,11 +364,11 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
> 		kmalloc_flags |= __GFP_NOWARN;
> 
> 		/*
>-		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
>+		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
> 		 * requests because there is no other way to tell the allocator
> 		 * that we want to fail rather than retry endlessly.
> 		 */
>-		if (!(kmalloc_flags & __GFP_REPEAT) ||
>+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
> 				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> 			kmalloc_flags |= __GFP_NORETRY;
> 	}
>diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>index 32979d945766..c2fa2e1b79fc 100644
>--- a/mm/vmalloc.c
>+++ b/mm/vmalloc.c
>@@ -1747,7 +1747,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  *	allocator with @gfp_mask flags.  Map them into contiguous
>  *	kernel virtual space, using a pagetable protection of @prot.
>  *
>- *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
>+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL
>  *	and __GFP_NOFAIL are not supported
>  *
>  *	Any use of gfp flags outside of GFP_KERNEL should be consulted
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 4e0a828781e5..8f547176e02c 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2435,18 +2435,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> 		return false;
> 
> 	/* Consider stopping depending on scan and reclaim activity */
>-	if (sc->gfp_mask & __GFP_REPEAT) {
>+	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
> 		/*
>-		 * For __GFP_REPEAT allocations, stop reclaiming if the
>+		 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
> 		 * full LRU list has been scanned and we are still failing
> 		 * to reclaim pages. This full LRU scan is potentially
>-		 * expensive but a __GFP_REPEAT caller really wants to succeed
>+		 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
> 		 */
> 		if (!nr_reclaimed && !nr_scanned)
> 			return false;
> 	} else {
> 		/*
>-		 * For non-__GFP_REPEAT allocations which can presumably
>+		 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
> 		 * fail without consequence, stop if we failed to reclaim
> 		 * any pages from the last SWAP_CLUSTER_MAX number of
> 		 * pages that were scanned. This will return to the
>diff --git a/net/core/dev.c b/net/core/dev.c
>index d947308ee255..3e659ac9e0ed 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -7121,7 +7121,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
> 
> 	BUG_ON(count < 1);
> 
>-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
>+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!rx)
> 		return -ENOMEM;
> 
>@@ -7161,7 +7161,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
> 	if (count < 1 || count > 0xffff)
> 		return -EINVAL;
> 
>-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
>+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!tx)
> 		return -ENOMEM;
> 
>@@ -7698,7 +7698,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> 	/* ensure 32-byte alignment of whole construct */
> 	alloc_size += NETDEV_ALIGN - 1;
> 
>-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
>+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!p)
> 		return NULL;
> 
>diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>index 9ccba86fa23d..26af038e27f0 100644
>--- a/net/core/skbuff.c
>+++ b/net/core/skbuff.c
>@@ -4653,7 +4653,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
> 
> 	gfp_head = gfp_mask;
> 	if (gfp_head & __GFP_DIRECT_RECLAIM)
>-		gfp_head |= __GFP_REPEAT;
>+		gfp_head |= __GFP_RETRY_MAYFAIL;
> 
> 	*errcode = -ENOBUFS;
> 	skb = alloc_skb(header_len, gfp_head);
>diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
>index 594f77d89f6c..daebe062a2dd 100644
>--- a/net/sched/sch_fq.c
>+++ b/net/sched/sch_fq.c
>@@ -640,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
> 		return 0;
> 
> 	/* If XPS was setup, we can allocate memory on right NUMA node */
>-	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
>+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> 			      netdev_queue_numa_node_read(sch->dev_queue));
> 	if (!array)
> 		return -ENOMEM;
>diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
>index 6da8d083e4e5..01ca903fcdb9 100644
>--- a/tools/perf/builtin-kmem.c
>+++ b/tools/perf/builtin-kmem.c
>@@ -638,7 +638,7 @@ static const struct {
> 	{ "__GFP_FS",			"F" },
> 	{ "__GFP_COLD",			"CO" },
> 	{ "__GFP_NOWARN",		"NWR" },
>-	{ "__GFP_REPEAT",		"R" },
>+	{ "__GFP_RETRY_MAYFAIL",	"R" },
> 	{ "__GFP_NOFAIL",		"NF" },
> 	{ "__GFP_NORETRY",		"NR" },
> 	{ "__GFP_COMP",			"C" },
>-- 
>2.11.0

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-03  2:24   ` Wei Yang
@ 2017-06-05  6:43       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-05  6:43 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Sat 03-06-17 10:24:40, Wei Yang wrote:
> Hi, Michal
> 
> Just go through your patch.
> 
> I have one question and one suggestion as below.
> 
> One suggestion:
> 
> This patch does two things to me:
> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
> 2. Adjust the logic in page_alloc to provide the middle semantic
> 
> My suggestion is to split these two task into two patches, so that readers
> could catch your fundamental logic change easily.

Well, the rename and the change is intentionally tight together. My
previous patches have removed all __GFP_REPEAT users for low order
requests which didn't have any implemented semantic. So as of now we
should only have those users which semantic will not change. I do not
add any new low order user in this patch so it in fact doesn't change
any existing semnatic.

> 
> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
> >From: Michal Hocko <mhocko@suse.com>
[...]
> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > 
> > 	/*
> > 	 * Do not retry costly high order allocations unless they are
> >-	 * __GFP_REPEAT
> >+	 * __GFP_RETRY_MAYFAIL
> > 	 */
> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> > 		goto nopage;
> 
> One question:
> 
> From your change log, it mentions will provide the same semantic for !costly
> allocations. While the logic here is the same as before.
> 
> For a !costly allocation with __GFP_REPEAT flag, the difference after this
> patch is no OOM will be invoked, while it will still continue in the loop.

Not really. There are two things. The above will shortcut retrying if
there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
back of in __alloc_pages_may_oom.
 
> Maybe I don't catch your point in this message:
> 
>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
>   the page allocator. This has been true but only for allocations requests
>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
>   smaller sizes. This is a bit unfortunate because there is no way to
>   express the same semantic for those requests and they are considered too
>   important to fail so they might end up looping in the page allocator for
>   ever, similarly to GFP_NOFAIL requests.
> 
> I thought you will provide the same semantic to !costly allocation, or I
> misunderstand?

yes and that is the case. __alloc_pages_may_oom will back off before OOM
killer is invoked and the allocator slow path will fail because
did_some_progress == 0;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
@ 2017-06-05  6:43       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-05  6:43 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Sat 03-06-17 10:24:40, Wei Yang wrote:
> Hi, Michal
> 
> Just go through your patch.
> 
> I have one question and one suggestion as below.
> 
> One suggestion:
> 
> This patch does two things to me:
> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
> 2. Adjust the logic in page_alloc to provide the middle semantic
> 
> My suggestion is to split these two task into two patches, so that readers
> could catch your fundamental logic change easily.

Well, the rename and the change is intentionally tight together. My
previous patches have removed all __GFP_REPEAT users for low order
requests which didn't have any implemented semantic. So as of now we
should only have those users which semantic will not change. I do not
add any new low order user in this patch so it in fact doesn't change
any existing semnatic.

> 
> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
> >From: Michal Hocko <mhocko@suse.com>
[...]
> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > 
> > 	/*
> > 	 * Do not retry costly high order allocations unless they are
> >-	 * __GFP_REPEAT
> >+	 * __GFP_RETRY_MAYFAIL
> > 	 */
> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> > 		goto nopage;
> 
> One question:
> 
> From your change log, it mentions will provide the same semantic for !costly
> allocations. While the logic here is the same as before.
> 
> For a !costly allocation with __GFP_REPEAT flag, the difference after this
> patch is no OOM will be invoked, while it will still continue in the loop.

Not really. There are two things. The above will shortcut retrying if
there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
back of in __alloc_pages_may_oom.
 
> Maybe I don't catch your point in this message:
> 
>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
>   the page allocator. This has been true but only for allocations requests
>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
>   smaller sizes. This is a bit unfortunate because there is no way to
>   express the same semantic for those requests and they are considered too
>   important to fail so they might end up looping in the page allocator for
>   ever, similarly to GFP_NOFAIL requests.
> 
> I thought you will provide the same semantic to !costly allocation, or I
> misunderstand?

yes and that is the case. __alloc_pages_may_oom will back off before OOM
killer is invoked and the allocator slow path will fail because
did_some_progress == 0;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-05  6:43       ` Michal Hocko
  (?)
@ 2017-06-06  3:04       ` Wei Yang
  2017-06-06 12:03           ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: Wei Yang @ 2017-06-06  3:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Yang, linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

[-- Attachment #1: Type: text/plain, Size: 3291 bytes --]

On Mon, Jun 05, 2017 at 08:43:43AM +0200, Michal Hocko wrote:
>On Sat 03-06-17 10:24:40, Wei Yang wrote:
>> Hi, Michal
>> 
>> Just go through your patch.
>> 
>> I have one question and one suggestion as below.
>> 
>> One suggestion:
>> 
>> This patch does two things to me:
>> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
>> 2. Adjust the logic in page_alloc to provide the middle semantic
>> 
>> My suggestion is to split these two task into two patches, so that readers
>> could catch your fundamental logic change easily.
>
>Well, the rename and the change is intentionally tight together. My
>previous patches have removed all __GFP_REPEAT users for low order
>requests which didn't have any implemented semantic. So as of now we
>should only have those users which semantic will not change. I do not
>add any new low order user in this patch so it in fact doesn't change
>any existing semnatic.
>
>> 
>> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
>> >From: Michal Hocko <mhocko@suse.com>
>[...]
>> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > 
>> > 	/*
>> > 	 * Do not retry costly high order allocations unless they are
>> >-	 * __GFP_REPEAT
>> >+	 * __GFP_RETRY_MAYFAIL
>> > 	 */
>> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>> > 		goto nopage;
>> 
>> One question:
>> 
>> From your change log, it mentions will provide the same semantic for !costly
>> allocations. While the logic here is the same as before.
>> 
>> For a !costly allocation with __GFP_REPEAT flag, the difference after this
>> patch is no OOM will be invoked, while it will still continue in the loop.
>
>Not really. There are two things. The above will shortcut retrying if
>there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
>back of in __alloc_pages_may_oom.
> 
>> Maybe I don't catch your point in this message:
>> 
>>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
>>   the page allocator. This has been true but only for allocations requests
>>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
>>   smaller sizes. This is a bit unfortunate because there is no way to
>>   express the same semantic for those requests and they are considered too
>>   important to fail so they might end up looping in the page allocator for
>>   ever, similarly to GFP_NOFAIL requests.
>> 
>> I thought you will provide the same semantic to !costly allocation, or I
>> misunderstand?
>
>yes and that is the case. __alloc_pages_may_oom will back off before OOM
>killer is invoked and the allocator slow path will fail because
>did_some_progress == 0;

Thanks for your explanation.

So same "semantic" doesn't mean same "behavior".
1. costly allocations will pick up the shut cut
2. !costly allocations will try something more but finally fail without
invoking OOM.

Hope this time I catch your point.

BTW, did_some_progress mostly means the OOM works to me. Are there some other
important situations when did_some_progress is set to 1?

>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-06  3:04       ` Wei Yang
@ 2017-06-06 12:03           ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-06 12:03 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Tue 06-06-17 11:04:01, Wei Yang wrote:
> On Mon, Jun 05, 2017 at 08:43:43AM +0200, Michal Hocko wrote:
> >On Sat 03-06-17 10:24:40, Wei Yang wrote:
> >> Hi, Michal
> >> 
> >> Just go through your patch.
> >> 
> >> I have one question and one suggestion as below.
> >> 
> >> One suggestion:
> >> 
> >> This patch does two things to me:
> >> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
> >> 2. Adjust the logic in page_alloc to provide the middle semantic
> >> 
> >> My suggestion is to split these two task into two patches, so that readers
> >> could catch your fundamental logic change easily.
> >
> >Well, the rename and the change is intentionally tight together. My
> >previous patches have removed all __GFP_REPEAT users for low order
> >requests which didn't have any implemented semantic. So as of now we
> >should only have those users which semantic will not change. I do not
> >add any new low order user in this patch so it in fact doesn't change
> >any existing semnatic.
> >
> >> 
> >> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
> >> >From: Michal Hocko <mhocko@suse.com>
> >[...]
> >> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > 
> >> > 	/*
> >> > 	 * Do not retry costly high order allocations unless they are
> >> >-	 * __GFP_REPEAT
> >> >+	 * __GFP_RETRY_MAYFAIL
> >> > 	 */
> >> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> >> > 		goto nopage;
> >> 
> >> One question:
> >> 
> >> From your change log, it mentions will provide the same semantic for !costly
> >> allocations. While the logic here is the same as before.
> >> 
> >> For a !costly allocation with __GFP_REPEAT flag, the difference after this
> >> patch is no OOM will be invoked, while it will still continue in the loop.
> >
> >Not really. There are two things. The above will shortcut retrying if
> >there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
> >back of in __alloc_pages_may_oom.
> > 
> >> Maybe I don't catch your point in this message:
> >> 
> >>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
> >>   the page allocator. This has been true but only for allocations requests
> >>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
> >>   smaller sizes. This is a bit unfortunate because there is no way to
> >>   express the same semantic for those requests and they are considered too
> >>   important to fail so they might end up looping in the page allocator for
> >>   ever, similarly to GFP_NOFAIL requests.
> >> 
> >> I thought you will provide the same semantic to !costly allocation, or I
> >> misunderstand?
> >
> >yes and that is the case. __alloc_pages_may_oom will back off before OOM
> >killer is invoked and the allocator slow path will fail because
> >did_some_progress == 0;
> 
> Thanks for your explanation.
> 
> So same "semantic" doesn't mean same "behavior".
> 1. costly allocations will pick up the shut cut

yes and there are no such allocations yet (based on my previous
cleanups)

> 2. !costly allocations will try something more but finally fail without
> invoking OOM.

no, the behavior will not change for those.
 
> Hope this time I catch your point.
> 
> BTW, did_some_progress mostly means the OOM works to me. Are there some other
> important situations when did_some_progress is set to 1?

Yes e.g. for GFP_NOFS when we cannot really invoke the OOM killer yet we
cannot fail the allocation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
@ 2017-06-06 12:03           ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-06 12:03 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Tue 06-06-17 11:04:01, Wei Yang wrote:
> On Mon, Jun 05, 2017 at 08:43:43AM +0200, Michal Hocko wrote:
> >On Sat 03-06-17 10:24:40, Wei Yang wrote:
> >> Hi, Michal
> >> 
> >> Just go through your patch.
> >> 
> >> I have one question and one suggestion as below.
> >> 
> >> One suggestion:
> >> 
> >> This patch does two things to me:
> >> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
> >> 2. Adjust the logic in page_alloc to provide the middle semantic
> >> 
> >> My suggestion is to split these two task into two patches, so that readers
> >> could catch your fundamental logic change easily.
> >
> >Well, the rename and the change is intentionally tight together. My
> >previous patches have removed all __GFP_REPEAT users for low order
> >requests which didn't have any implemented semantic. So as of now we
> >should only have those users which semantic will not change. I do not
> >add any new low order user in this patch so it in fact doesn't change
> >any existing semnatic.
> >
> >> 
> >> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
> >> >From: Michal Hocko <mhocko@suse.com>
> >[...]
> >> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > 
> >> > 	/*
> >> > 	 * Do not retry costly high order allocations unless they are
> >> >-	 * __GFP_REPEAT
> >> >+	 * __GFP_RETRY_MAYFAIL
> >> > 	 */
> >> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> >> > 		goto nopage;
> >> 
> >> One question:
> >> 
> >> From your change log, it mentions will provide the same semantic for !costly
> >> allocations. While the logic here is the same as before.
> >> 
> >> For a !costly allocation with __GFP_REPEAT flag, the difference after this
> >> patch is no OOM will be invoked, while it will still continue in the loop.
> >
> >Not really. There are two things. The above will shortcut retrying if
> >there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
> >back of in __alloc_pages_may_oom.
> > 
> >> Maybe I don't catch your point in this message:
> >> 
> >>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
> >>   the page allocator. This has been true but only for allocations requests
> >>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
> >>   smaller sizes. This is a bit unfortunate because there is no way to
> >>   express the same semantic for those requests and they are considered too
> >>   important to fail so they might end up looping in the page allocator for
> >>   ever, similarly to GFP_NOFAIL requests.
> >> 
> >> I thought you will provide the same semantic to !costly allocation, or I
> >> misunderstand?
> >
> >yes and that is the case. __alloc_pages_may_oom will back off before OOM
> >killer is invoked and the allocator slow path will fail because
> >did_some_progress == 0;
> 
> Thanks for your explanation.
> 
> So same "semantic" doesn't mean same "behavior".
> 1. costly allocations will pick up the shut cut

yes and there are no such allocations yet (based on my previous
cleanups)

> 2. !costly allocations will try something more but finally fail without
> invoking OOM.

no, the behavior will not change for those.
 
> Hope this time I catch your point.
> 
> BTW, did_some_progress mostly means the OOM works to me. Are there some other
> important situations when did_some_progress is set to 1?

Yes e.g. for GFP_NOFS when we cannot really invoke the OOM killer yet we
cannot fail the allocation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-06 12:03           ` Michal Hocko
  (?)
@ 2017-06-07  2:10           ` Wei Yang
  2017-06-09  7:32               ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: Wei Yang @ 2017-06-07  2:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Yang, linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

[-- Attachment #1: Type: text/plain, Size: 5380 bytes --]

On Tue, Jun 06, 2017 at 02:03:15PM +0200, Michal Hocko wrote:
>On Tue 06-06-17 11:04:01, Wei Yang wrote:
>> On Mon, Jun 05, 2017 at 08:43:43AM +0200, Michal Hocko wrote:
>> >On Sat 03-06-17 10:24:40, Wei Yang wrote:
>> >> Hi, Michal
>> >> 
>> >> Just go through your patch.
>> >> 
>> >> I have one question and one suggestion as below.
>> >> 
>> >> One suggestion:
>> >> 
>> >> This patch does two things to me:
>> >> 1. Replace __GFP_REPEAT with __GFP_RETRY_MAYFAIL
>> >> 2. Adjust the logic in page_alloc to provide the middle semantic
>> >> 
>> >> My suggestion is to split these two task into two patches, so that readers
>> >> could catch your fundamental logic change easily.
>> >
>> >Well, the rename and the change is intentionally tight together. My
>> >previous patches have removed all __GFP_REPEAT users for low order
>> >requests which didn't have any implemented semantic. So as of now we
>> >should only have those users which semantic will not change. I do not
>> >add any new low order user in this patch so it in fact doesn't change
>> >any existing semnatic.
>> >
>> >> 
>> >> On Tue, Mar 07, 2017 at 04:48:41PM +0100, Michal Hocko wrote:
>> >> >From: Michal Hocko <mhocko@suse.com>
>> >[...]
>> >> >@@ -3776,9 +3784,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > 
>> >> > 	/*
>> >> > 	 * Do not retry costly high order allocations unless they are
>> >> >-	 * __GFP_REPEAT
>> >> >+	 * __GFP_RETRY_MAYFAIL
>> >> > 	 */
>> >> >-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>> >> >+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>> >> > 		goto nopage;
>> >> 
>> >> One question:
>> >> 
>> >> From your change log, it mentions will provide the same semantic for !costly
>> >> allocations. While the logic here is the same as before.
>> >> 
>> >> For a !costly allocation with __GFP_REPEAT flag, the difference after this
>> >> patch is no OOM will be invoked, while it will still continue in the loop.
>> >
>> >Not really. There are two things. The above will shortcut retrying if
>> >there is _no_ __GFP_RETRY_MAYFAIL. If the flags _is_ specified we will
>> >back of in __alloc_pages_may_oom.
>> > 
>> >> Maybe I don't catch your point in this message:
>> >> 
>> >>   __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
>> >>   the page allocator. This has been true but only for allocations requests
>> >>   larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
>> >>   smaller sizes. This is a bit unfortunate because there is no way to
>> >>   express the same semantic for those requests and they are considered too
>> >>   important to fail so they might end up looping in the page allocator for
>> >>   ever, similarly to GFP_NOFAIL requests.
>> >> 
>> >> I thought you will provide the same semantic to !costly allocation, or I
>> >> misunderstand?
>> >
>> >yes and that is the case. __alloc_pages_may_oom will back off before OOM
>> >killer is invoked and the allocator slow path will fail because
>> >did_some_progress == 0;
>> 
>> Thanks for your explanation.
>> 
>> So same "semantic" doesn't mean same "behavior".
>> 1. costly allocations will pick up the shut cut
>
>yes and there are no such allocations yet (based on my previous
>cleanups)
>
>> 2. !costly allocations will try something more but finally fail without
>> invoking OOM.
>
>no, the behavior will not change for those.
> 

Hmm... Let me be more specific. With two factors, costly or not, flag set or
not, we have four combinations. Here it is classified into two categories.

1. __GFP_RETRY_MAYFAIL not set

Brief description on behavior:
    costly: pick up the shortcut, so no OOM
    !costly: no shortcut and will OOM I think

Impact from this patch set:
    No.
    
My personal understanding:
    The allocation without __GFP_RETRY_MAYFAIL is not effected by this patch
    set.  Since !costly allocation will trigger OOM, this is the reason why
    "small allocations never fail _practically_", as mentioned in
    https://lwn.net/Articles/723317/.


3. __GFP_RETRY_MAYFAIL set

Brief description on behavior:
    costly/!costly: no shortcut here and no OOM invoked

Impact from this patch set:
    For those allocations with __GFP_RETRY_MAYFAIL, OOM is not invoked for
    both.

My personal understanding:
    This is the semantic you are willing to introduce in this patch set. By
    cutting off the OOM invoke when __GFP_RETRY_MAYFAIL is set, you makes this
    a middle situation between NOFAIL and NORETRY.
    
    page_alloc will try some luck to get some free pages without disturb other
    part of the system. By doing so, the never fail allocation for !costly
    pages will be "fixed". If I understand correctly, you are willing to make
    this the default behavior in the future?

>> Hope this time I catch your point.
>> 
>> BTW, did_some_progress mostly means the OOM works to me. Are there some other
>> important situations when did_some_progress is set to 1?
>
>Yes e.g. for GFP_NOFS when we cannot really invoke the OOM killer yet we
>cannot fail the allocation.


Thanks, currently I have a clear concept on this, while I will remember this :)

>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-07  2:10           ` Wei Yang
@ 2017-06-09  7:32               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-09  7:32 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed 07-06-17 10:10:36, Wei Yang wrote:
[...]
> Hmm... Let me be more specific. With two factors, costly or not, flag set or
> not, we have four combinations. Here it is classified into two categories.
> 
> 1. __GFP_RETRY_MAYFAIL not set
> 
> Brief description on behavior:
>     costly: pick up the shortcut, so no OOM
>     !costly: no shortcut and will OOM I think
> 
> Impact from this patch set:
>     No.

true

> My personal understanding:
>     The allocation without __GFP_RETRY_MAYFAIL is not effected by this patch
>     set.  Since !costly allocation will trigger OOM, this is the reason why
>     "small allocations never fail _practically_", as mentioned in
>     https://lwn.net/Articles/723317/.
> 
> 
> 3. __GFP_RETRY_MAYFAIL set
> 
> Brief description on behavior:
>     costly/!costly: no shortcut here and no OOM invoked
> 
> Impact from this patch set:
>     For those allocations with __GFP_RETRY_MAYFAIL, OOM is not invoked for
>     both.

yes

> My personal understanding:
>     This is the semantic you are willing to introduce in this patch set. By
>     cutting off the OOM invoke when __GFP_RETRY_MAYFAIL is set, you makes this
>     a middle situation between NOFAIL and NORETRY.

yes

>     page_alloc will try some luck to get some free pages without disturb other
>     part of the system. By doing so, the never fail allocation for !costly
>     pages will be "fixed". If I understand correctly, you are willing to make
>     this the default behavior in the future?

I do not think we can make this a default in a foreseeable future
unfortunately. That's why I've made it a gfp modifier in the first
place. I assume many users will opt in by using the flag. In future we
can even help by adding a highlevel GFP_$FOO flag but I am worried that
this would just add to the explosion of existing highlevel gfp masks
(e.g. do we want GFP_NOFS_MAY_FAIL, GFP_USER_MAY_FAIL,
GFP_USER_HIGH_MOVABLE_MAYFAIL etc...)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
@ 2017-06-09  7:32               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2017-06-09  7:32 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-mm, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML

On Wed 07-06-17 10:10:36, Wei Yang wrote:
[...]
> Hmm... Let me be more specific. With two factors, costly or not, flag set or
> not, we have four combinations. Here it is classified into two categories.
> 
> 1. __GFP_RETRY_MAYFAIL not set
> 
> Brief description on behavior:
>     costly: pick up the shortcut, so no OOM
>     !costly: no shortcut and will OOM I think
> 
> Impact from this patch set:
>     No.

true

> My personal understanding:
>     The allocation without __GFP_RETRY_MAYFAIL is not effected by this patch
>     set.  Since !costly allocation will trigger OOM, this is the reason why
>     "small allocations never fail _practically_", as mentioned in
>     https://lwn.net/Articles/723317/.
> 
> 
> 3. __GFP_RETRY_MAYFAIL set
> 
> Brief description on behavior:
>     costly/!costly: no shortcut here and no OOM invoked
> 
> Impact from this patch set:
>     For those allocations with __GFP_RETRY_MAYFAIL, OOM is not invoked for
>     both.

yes

> My personal understanding:
>     This is the semantic you are willing to introduce in this patch set. By
>     cutting off the OOM invoke when __GFP_RETRY_MAYFAIL is set, you makes this
>     a middle situation between NOFAIL and NORETRY.

yes

>     page_alloc will try some luck to get some free pages without disturb other
>     part of the system. By doing so, the never fail allocation for !costly
>     pages will be "fixed". If I understand correctly, you are willing to make
>     this the default behavior in the future?

I do not think we can make this a default in a foreseeable future
unfortunately. That's why I've made it a gfp modifier in the first
place. I assume many users will opt in by using the flag. In future we
can even help by adding a highlevel GFP_$FOO flag but I am worried that
this would just add to the explosion of existing highlevel gfp masks
(e.g. do we want GFP_NOFS_MAY_FAIL, GFP_USER_MAY_FAIL,
GFP_USER_HIGH_MOVABLE_MAYFAIL etc...)
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-06-09  7:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-07 15:48 [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic Michal Hocko
2017-03-07 15:48 ` Michal Hocko
2017-03-07 15:48 ` [PATCH 1/4] s390: get rid of superfluous __GFP_REPEAT Michal Hocko
2017-03-07 15:48   ` Michal Hocko
2017-03-08  8:23   ` Heiko Carstens
2017-03-08  8:23     ` Heiko Carstens
2017-03-08 14:11     ` Michal Hocko
2017-03-08 14:11       ` Michal Hocko
2017-03-09  8:27       ` Heiko Carstens
2017-03-09  8:27         ` Heiko Carstens
2017-03-07 15:48 ` [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
2017-03-07 15:48   ` Michal Hocko
2017-05-25  1:21   ` NeilBrown
2017-05-31 11:42     ` Michal Hocko
2017-05-31 11:42       ` Michal Hocko
2017-06-03  2:24   ` Wei Yang
2017-06-05  6:43     ` Michal Hocko
2017-06-05  6:43       ` Michal Hocko
2017-06-06  3:04       ` Wei Yang
2017-06-06 12:03         ` Michal Hocko
2017-06-06 12:03           ` Michal Hocko
2017-06-07  2:10           ` Wei Yang
2017-06-09  7:32             ` Michal Hocko
2017-06-09  7:32               ` Michal Hocko
2017-03-07 15:48 ` [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL Michal Hocko
2017-03-07 15:48   ` Michal Hocko
2017-03-07 17:05   ` Darrick J. Wong
2017-03-07 17:05     ` Darrick J. Wong
2017-03-08  9:35     ` Michal Hocko
2017-03-08  9:35       ` Michal Hocko
2017-03-08 11:23   ` Tetsuo Handa
2017-03-08 11:23     ` Tetsuo Handa
2017-03-08 12:54     ` Michal Hocko
2017-03-08 12:54       ` Michal Hocko
2017-03-08 15:06   ` Christoph Hellwig
2017-03-08 15:06     ` Christoph Hellwig
2017-03-09  9:16     ` Michal Hocko
2017-03-09  9:16       ` Michal Hocko
2017-03-07 15:48 ` [RFC PATCH 4/4] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes Michal Hocko
2017-03-07 15:48   ` Michal Hocko
2017-05-16  9:10 ` [RFC PATCH 0/4 v2] mm: give __GFP_REPEAT a better semantic Michal Hocko
2017-05-16  9:10   ` Michal Hocko
2017-05-23  8:12   ` Vlastimil Babka
2017-05-23  8:12     ` Vlastimil Babka
2017-05-24  1:06     ` NeilBrown
2017-05-24  7:34       ` Michal Hocko
2017-05-24  7:34         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.