linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic
@ 2016-06-06 11:32 Michal Hocko
  2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner,
	Rik van Riel, Dave Chinner, LKML

Hi,
this is a follow up for __GFP_REPEAT clean up merged into mmotm just
recently [1]. The main motivation for the change is that the current
implementation of __GFP_REPEAT is not very much useful.

The documentation says:
 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 *   _might_ fail.  This depends upon the particular VM implementation.

It just fails to mention that this is true only for large (costly) high
order which has been the case since the flag was introduced. A similar
semantic would be really helpful for smal orders as well, though,
because we have places where a failure with a specific fallback error
handling is preferred to a potential endless loop inside the page
allocator.

The cleanup [1] dropped __GFP_REPEAT usage for low (!costly) order users
so only those which might use larger orders have stayed. Let's rename the
flag to something more verbose and use it for existing users. Semantic for
those will not change. Then implement low (!costly) orders failure path
which is hit after the page allocator is about to hit the oom killer
path again.  That means that the first OOM killer invocation and all the
retries after then haven't helped to move on. This seems like a good
indication that any further progress is highly unlikely.

Xfs code already has an existing annotation for allocations which are
allowed to fail and we can trivially map them to the new gfp flag
because it will provide the semantic KM_MAYFAIL wants.

I assume we will grow more users - e.g. GFP_USER sounds like it could
use the flag by default. But I haven't explored this path properly yet.

I am sending this as an RFC and would like to hear back about the
approach. We have discussed this at LSF this year and there were
different ideas how to achieve the semantic. I have decided to go
__GFP_RETRY_HARD way because it nicely fits into the existing scheme
where __GFP_NORETRY and __GFP_NOFAIL already modify the default behavior
of the page allocator and the new flag would fit nicely between the two
existing flags. The patch 1 is much more verbose about different modes
of operation of the page allocator.

Thanks
---
[1] http://lkml.kernel.org/r/1464599699-30131-1-git-send-email-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
@ 2016-06-06 11:32 ` Michal Hocko
  2016-06-07 12:11   ` Tetsuo Handa
  2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko
  2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner,
	Rik van Riel, Dave Chinner, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations requests
larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
smaller sizes. This is a bit unfortunate because there is no way to
express the same semantic for those requests and they are considered too
important to fail so they might end up looping in the page allocator for
ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_HARD which tells the user that
the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
  attempt to free memory at all. The most light weight mode which even
  doesn't kick the background reclaim. Should be used carefully because
  it might deplete the memory and the next user might hit the more
  aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
  allocation without any attempt to free memory from the current context
  but can wake kswapd to reclaim memory if the zone is below the low
  watermark. Can be used from either atomic contexts or when the request
  is a performance optimization and there is another fallback for a slow
  path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
  sleeping allocation with an expensive fallback so it can access some
  portion of memory reserves. Usually used from interrupt/bh context with
  an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail (unless the requesting task
  is killed by the OOM killer) and costly will fail early rather than
  cause disruptive reclaim.
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
  all allocation requests fail early rather than cause disruptive
  reclaim (one round of reclaim in this implementation). No OOM killer
  is invoked.
- GFP_KERNEL | __GFP_RETRY_HARD - overrides the default allocator behavior
  and all allocation requests try really hard, !costly are allowed to
  invoke OOM killer. The request will fail if no progress is expected.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
  and all allocation requests will loop endlessly until they
  succeed. This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_HARD because
they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_HARD if
there is no progress and we have already passed the OOM point. This
means that all the reclaim opportunities have been exhausted and
retrying doesn't make much sense most probably.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 drivers/block/xen-blkfront.c                 |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vhost.c                        |  2 +-
 fs/btrfs/check-integrity.c                   |  2 +-
 fs/btrfs/raid56.c                            |  2 +-
 include/linux/gfp.h                          | 32 +++++++++++++++++++---------
 include/linux/slab.h                         |  3 ++-
 include/trace/events/mmflags.h               |  2 +-
 mm/huge_memory.c                             |  2 +-
 mm/hugetlb.c                                 |  4 ++--
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 19 ++++++++++++++---
 mm/sparse-vmemmap.c                          |  4 ++--
 mm/vmscan.c                                  |  8 +++----
 net/core/dev.c                               |  6 +++---
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 25 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
index b1a19835e907..5b594dfb1783 100644
--- a/Documentation/DMA-ISA-LPC.txt
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc.
 
 Unfortunately the memory available for ISA DMA is scarce so unless you
 allocate the memory during boot-up it's a good idea to also pass
-__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder.
+__GFP_RETRY_HARD and __GFP_NOWARN to make the allocater try a bit harder.
 
 (This scarcity also means that you should allocate the buffer as
 early as possible and not release it until the driver is unloaded.)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index d14fcf82c00c..be3b996915a9 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 	return (pgd_t *)__get_free_page(PGALLOC_GFP);
 #else
 	struct page *page;
-	page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4);
+	page = alloc_pages(PGALLOC_GFP | __GFP_RETRY_HARD, 4);
 	if (!page)
 		return NULL;
 	return (pgd_t *) page_address(page);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 05f09ae82587..204484fbda51 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -72,7 +72,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 	/* Lastly try successively smaller sizes from the page allocator */
 	/* Only do this if userspace didn't specify a size via ioctl */
 	while (!hpt && order > PPC_MIN_HPT_ORDER && !htab_orderp) {
-		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_HARD|
 				       __GFP_NOWARN, order - PAGE_SHIFT);
 		if (!hpt)
 			--order;
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca13df854639..2633e1c32a45 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2029,7 +2029,7 @@ static int blkif_recover(struct blkfront_info *info)
 		rinfo = &info->rinfo[r_index];
 		/* Stage 1: Make a safe copy of the shadow state. */
 		copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
-			       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
+			       GFP_NOIO | __GFP_RETRY_HARD | __GFP_HIGH);
 		if (!copy)
 			return -ENOMEM;
 
diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c
index c3fd16d997ca..cb71a383c4ec 100644
--- a/drivers/mmc/host/wbsd.c
+++ b/drivers/mmc/host/wbsd.c
@@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma)
 	 * order for ISA to be able to DMA to it.
 	 */
 	host->dma_buffer = kmalloc(WBSD_DMA_SIZE,
-		GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
+		GFP_NOIO | GFP_DMA | __GFP_RETRY_HARD | __GFP_NOWARN);
 	if (!host->dma_buffer)
 		goto free;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 2a67b496a9e2..d5ecf3007ac6 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count,
 	}
 	if (!session->response)
 		session->response = (char *)__get_free_pages(GFP_KERNEL
-						| __GFP_REPEAT | GFP_DMA,
+						| __GFP_RETRY_HARD | GFP_DMA,
 						get_order(session->bufsize));
 	if (!session->response) {
 		mutex_unlock(&session->mutex);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 5ab3967dda43..6102177b2c7b 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -251,7 +251,7 @@ int transport_alloc_session_tags(struct se_session *se_sess,
 	int rc;
 
 	se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
-					GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+					GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!se_sess->sess_cmd_map) {
 		se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
 		if (!se_sess->sess_cmd_map) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb3e2b4..820a4715eb38 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -747,7 +747,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int i;
 
-	n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!n) {
 		n = vmalloc(sizeof *n);
 		if (!n)
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 9d6320e8ff3e..7150b45e092f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1405,7 +1405,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
 
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!vs) {
 		vs = vzalloc(sizeof(*vs));
 		if (!vs)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1e2bb6..a4b0f18a69ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
 
 static void *vhost_kvzalloc(unsigned long size)
 {
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 
 	if (!n)
 		n = vzalloc(size);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index b677a6ea6001..f8658fbbfa60 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -3049,7 +3049,7 @@ int btrfsic_mount(struct btrfs_root *root,
 		       root->sectorsize, PAGE_SIZE);
 		return -1;
 	}
-	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!state) {
 		state = vzalloc(sizeof(*state));
 		if (!state) {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f8b6d411a034..db2bba5cb90a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -218,7 +218,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info)
 	 * of a failing mount.
 	 */
 	table_size = sizeof(*table) + sizeof(*h) * num_entries;
-	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!table) {
 		table = vzalloc(table_size);
 		if (!table)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c29e9d347bc6..9961086eac2e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_REPEAT		0x400u
+#define ___GFP_RETRY_HARD		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -132,26 +132,38 @@ struct vm_area_struct;
  *
  * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
  *
- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail.  This depends upon the particular VM implementation.
+ * The default allocator behavior depends on the request size. We have a concept
+ * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
+ * !costly allocations are too essential to fail so they are implicitly
+ * non-failing (with some exceptions like OOM victims might fail) by default while
+ * costly requests try to be not disruptive and back off even without invoking
+ * the OOM killer. The following three modifiers might be used to override some of
+ * these implicit rules
+ *
+ * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
+ *   return NULL when direct reclaim and memory compaction have failed to allow
+ *   the allocation to succeed.  The OOM killer is not called with the current
+ *   implementation. This is a default mode for costly allocations.
+ *
+ * __GFP_RETRY_HARD: Try hard to allocate the memory, but the allocation attempt
+ *   _might_ fail. All viable forms of memory reclaim are tried before the fail
+ *   including the OOM killer for !costly allocations. This can be used to override
+ *   non-failing default behavior for !costly requests as well as fortify costly
+ *   requests.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  *   cannot handle allocation failures. New users should be evaluated carefully
  *   (and the flag should be used only when there is no reasonable failure
  *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator.
- *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation.
+ *   opencode endless loop around allocator. Using this flag for costly allocations
+ *   is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
+#define __GFP_RETRY_HARD	((__force gfp_t)___GFP_RETRY_HARD)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index aeb3e6d00a66..cacd437fdbf4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -457,7 +457,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  *
  * %__GFP_NOWARN - If allocation fails, don't issue any warnings.
  *
- * %__GFP_REPEAT - If allocation fails initially, try once more before failing.
+ * %__GFP_RETRY_HARD - Try really hard to succeed the allocation but fail
+ *   eventually.
  *
  * There are other flags available as well, but these are not intended
  * for general use, and so are not documented here. For a full list of
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 43cedbf0c759..c5f488767860 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -30,7 +30,7 @@
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_COLD,		"__GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
-	{(unsigned long)__GFP_REPEAT,		"__GFP_REPEAT"},	\
+	{(unsigned long)__GFP_RETRY_HARD,	"__GFP_RETRY_HARD"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
 	{(unsigned long)__GFP_NORETRY,		"__GFP_NORETRY"},	\
 	{(unsigned long)__GFP_COMP,		"__GFP_COMP"},		\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index acd374e200cf..69872951f653 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -378,7 +378,7 @@ static ssize_t single_flag_store(struct kobject *kobj,
 
 /*
  * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
- * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
+ * __GFP_RETRY_HARD is too aggressive, it's never worth swapping tons of
  * memory just to allocate one more hugepage.
  */
 static ssize_t defrag_show(struct kobject *kobj,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e197cd7080e6..62306b5a302a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1363,7 +1363,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 
 	page = __alloc_pages_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-						__GFP_REPEAT|__GFP_NOWARN,
+						__GFP_RETRY_HARD|__GFP_NOWARN,
 		huge_page_order(h));
 	if (page) {
 		prep_new_huge_page(h, page, nid);
@@ -1480,7 +1480,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	int order = huge_page_order(h);
-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_HARD|__GFP_NOWARN;
 	unsigned int cpuset_mems_cookie;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 420bbe300bcd..083c87c539b6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -23,7 +23,7 @@
  * hints such as HIGHMEM usage.
  */
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+			__GFP_NOWARN|__GFP_RETRY_HARD|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* The GFP flags allowed during early boot */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 180f5afc5a1f..faa3d4a27850 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3262,7 +3262,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 		return compaction_zonelist_suitable(ac, order, alloc_flags);
 
 	/*
-	 * !costly requests are much more important than __GFP_REPEAT
+	 * !costly requests are much more important than __GFP_RETRY_HARD
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
 	 * to cope with that. 1/4 retries is rather arbitrary but we
@@ -3550,6 +3550,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+	bool passed_oom = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3680,9 +3681,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * Do not retry costly high order allocations unless they are
-	 * __GFP_REPEAT
+	 * __GFP_RETRY_HARD
 	 */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_HARD))
 		goto noretry;
 
 	/*
@@ -3711,6 +3712,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				compaction_retries))
 		goto retry;
 
+	/*
+	 * We have already exhausted all our reclaim opportunities including
+	 * the OOM killer without any success so it is time to admit defeat.
+	 * We do not care about the order because we want all orders to behave
+	 * consistently including !costly ones. costly are handled in
+	 * __alloc_pages_may_oom and will bail out even before the first OOM
+	 * killer invocation
+	 */
+	if (passed_oom && (gfp_mask & __GFP_RETRY_HARD))
+		goto nopage;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		passed_oom = true;
 		goto retry;
 	}
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 68885dcbaf40..261facd8e1c8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 
 		if (node_state(node, N_HIGH_MEMORY))
 			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_HARD,
 				get_order(size));
 		else
 			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_HARD,
 				get_order(size));
 		if (page)
 			return page_address(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 93ba33789ac6..ff21efe06430 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2319,18 +2319,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
 		return false;
 
 	/* Consider stopping depending on scan and reclaim activity */
-	if (sc->gfp_mask & __GFP_REPEAT) {
+	if (sc->gfp_mask & __GFP_RETRY_HARD) {
 		/*
-		 * For __GFP_REPEAT allocations, stop reclaiming if the
+		 * For __GFP_RETRY_HARD allocations, stop reclaiming if the
 		 * full LRU list has been scanned and we are still failing
 		 * to reclaim pages. This full LRU scan is potentially
-		 * expensive but a __GFP_REPEAT caller really wants to succeed
+		 * expensive but a __GFP_RETRY_HARD caller really wants to succeed
 		 */
 		if (!nr_reclaimed && !nr_scanned)
 			return false;
 	} else {
 		/*
-		 * For non-__GFP_REPEAT allocations which can presumably
+		 * For non-__GFP_RETRY_HARD allocations which can presumably
 		 * fail without consequence, stop if we failed to reclaim
 		 * any pages from the last SWAP_CLUSTER_MAX number of
 		 * pages that were scanned. This will return to the
diff --git a/net/core/dev.c b/net/core/dev.c
index 904ff431d570..8a916dd1d833 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6897,7 +6897,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!rx) {
 		rx = vzalloc(sz);
 		if (!rx)
@@ -6939,7 +6939,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!tx) {
 		tx = vzalloc(sz);
 		if (!tx)
@@ -7477,7 +7477,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
 	if (!p)
 		p = vzalloc(alloc_size);
 	if (!p)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e7ec6d3ad5f0..c4c667a7b39b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4621,7 +4621,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 	gfp_head = gfp_mask;
 	if (gfp_head & __GFP_DIRECT_RECLAIM)
-		gfp_head |= __GFP_REPEAT;
+		gfp_head |= __GFP_RETRY_HARD;
 
 	*errcode = -ENOBUFS;
 	skb = alloc_skb(header_len, gfp_head);
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 3c6a47d66a04..7ed2224df628 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -599,7 +599,7 @@ static void *fq_alloc_node(size_t sz, int node)
 {
 	void *ptr;
 
-	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
+	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_RETRY_HARD | __GFP_NOWARN, node);
 	if (!ptr)
 		ptr = vmalloc_node(sz, node);
 	return ptr;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 58adfee230de..a63f8f4e6a6a 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -627,7 +627,7 @@ static const struct {
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_COLD",			"CO" },
 	{ "__GFP_NOWARN",		"NWR" },
-	{ "__GFP_REPEAT",		"R" },
+	{ "__GFP_RETRY_HARD",		"R" },
 	{ "__GFP_NOFAIL",		"NF" },
 	{ "__GFP_NORETRY",		"NR" },
 	{ "__GFP_COMP",			"C" },
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
  2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko
@ 2016-06-06 11:32 ` Michal Hocko
  2016-06-16  0:23   ` Dave Chinner
  2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner,
	Rik van Riel, Dave Chinner, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set
of flags. This means that small allocations actually never failed.

Now that we have __GFP_RETRY_HARD flags which works independently on the
allocation request size we can map KM_MAYFAIL to it. The allocator will
try as hard as it can to fulfill the request but fails eventually if
the progress cannot be made.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..34e6b062ce0e 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
 			lflags &= ~__GFP_FS;
 	}
 
+	if (flags & KM_MAYFAIL)
+		lflags |= __GFP_RETRY_HARD;
+
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko
@ 2016-06-07 12:11   ` Tetsuo Handa
  2016-06-07 12:31     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2016-06-07 12:11 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner,
	Rik van Riel, Dave Chinner, LKML, Michal Hocko

On 2016/06/06 20:32, Michal Hocko wrote:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 669fef1e2bb6..a4b0f18a69ab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
>  
>  static void *vhost_kvzalloc(unsigned long size)
>  {
> -	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> +	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);

Remaining __GFP_REPEAT users are not always doing costly allocations.
Sometimes they pass __GFP_REPEAT because the size is given from userspace.
Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.

What I think more important is hearing from __GFP_REPEAT users how hard they
want to retry. It is possible that they want to retry unless SIGKILL is
delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT
is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE
if __GFP_KILLABLE were available. In my module (though I'm not using
__GFP_REPEAT), I want to retry unless SIGKILL is delivered.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 180f5afc5a1f..faa3d4a27850 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3262,7 +3262,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  		return compaction_zonelist_suitable(ac, order, alloc_flags);
>  
>  	/*
> -	 * !costly requests are much more important than __GFP_REPEAT
> +	 * !costly requests are much more important than __GFP_RETRY_HARD
>  	 * costly ones because they are de facto nofail and invoke OOM
>  	 * killer to move on while costly can fail and users are ready
>  	 * to cope with that. 1/4 retries is rather arbitrary but we
> @@ -3550,6 +3550,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	enum compact_result compact_result;
>  	int compaction_retries = 0;
>  	int no_progress_loops = 0;
> +	bool passed_oom = false;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3680,9 +3681,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  
>  	/*
>  	 * Do not retry costly high order allocations unless they are
> -	 * __GFP_REPEAT
> +	 * __GFP_RETRY_HARD
>  	 */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_HARD))
>  		goto noretry;
>  
>  	/*
> @@ -3711,6 +3712,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				compaction_retries))
>  		goto retry;
>  
> +	/*
> +	 * We have already exhausted all our reclaim opportunities including
> +	 * the OOM killer without any success so it is time to admit defeat.
> +	 * We do not care about the order because we want all orders to behave
> +	 * consistently including !costly ones. costly are handled in
> +	 * __alloc_pages_may_oom and will bail out even before the first OOM
> +	 * killer invocation
> +	 */
> +	if (passed_oom && (gfp_mask & __GFP_RETRY_HARD))
> +		goto nopage;
> +

If __GFP_REPEAT was passed because the size is not known at compile time, this
will break "!costly allocations will retry unless TIF_MEMDIE is set" behavior.

>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
>  		no_progress_loops = 0;
> +		passed_oom = true;

This is too premature. did_some_progress != 0 after returning from
__alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
that mutex_trylock(&oom_lock) was attempted. It is possible that somebody else
is on the way to call out_of_memory(). It is possible that the OOM reaper is
about to start reaping memory. Giving up after 1 jiffie of sleep is too fast.

>  		goto retry;
>  	}
>  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-07 12:11   ` Tetsuo Handa
@ 2016-06-07 12:31     ` Michal Hocko
  2016-06-11 14:35       ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-07 12:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Rik van Riel, Dave Chinner, LKML

On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> On 2016/06/06 20:32, Michal Hocko wrote:
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 669fef1e2bb6..a4b0f18a69ab 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
> >  
> >  static void *vhost_kvzalloc(unsigned long size)
> >  {
> > -	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> > +	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD);
> 
> Remaining __GFP_REPEAT users are not always doing costly allocations.

Yes but...

> Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.

Would that be a regression though? Strictly speaking the __GFP_REPEAT
documentation was explicit to not loop for ever. So nobody should have
expected nofail semantic pretty much by definition. The fact that our
previous implementation was not fully conforming to the documentation is
just an implementation detail.  All the remaining users of __GFP_REPEAT
_have_ to be prepared for the allocation failure. So what exactly is the
problem with them?

> What I think more important is hearing from __GFP_REPEAT users how hard they
> want to retry. It is possible that they want to retry unless SIGKILL is
> delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT
> is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE
> if __GFP_KILLABLE were available. In my module (though I'm not using
> __GFP_REPEAT), I want to retry unless SIGKILL is delivered.

To be honest killability for a particular allocation request sounds
like a hack to me. Just consider the expected semantic. How do you
handle when one path uses explicit __GFP_KILLABLE while other path (from
the same syscall) is not... If anything this would have to be process
context wise.
 
[...]
> >  	/* Reclaim has failed us, start killing things */
> >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> >  	if (page)
> > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	/* Retry as long as the OOM killer is making progress */
> >  	if (did_some_progress) {
> >  		no_progress_loops = 0;
> > +		passed_oom = true;
> 
> This is too premature. did_some_progress != 0 after returning from
> __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
> that mutex_trylock(&oom_lock) was attempted.

which means that we have reached the OOM condition and _somebody_ is
actaully handling the OOM on our behalf.

> It is possible that somebody else
> is on the way to call out_of_memory(). It is possible that the OOM reaper is
> about to start reaping memory. Giving up after 1 jiffie of sleep is too fast.

Sure this will always be racy. But the primary point is that we have
passed the OOM line and then passed through all the retries to get to
the same state again. This sounds like a pretty natural boundary to tell
we have tried hard enough to rather fail and let the caller handle the
fallback.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-07 12:31     ` Michal Hocko
@ 2016-06-11 14:35       ` Tetsuo Handa
  2016-06-13 11:37         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2016-06-11 14:35 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

Michal Hocko wrote:
> On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> > Remaining __GFP_REPEAT users are not always doing costly allocations.
> 
> Yes but...
> 
> > Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.
> 
> Would that be a regression though? Strictly speaking the __GFP_REPEAT
> documentation was explicit to not loop for ever. So nobody should have
> expected nofail semantic pretty much by definition. The fact that our
> previous implementation was not fully conforming to the documentation is
> just an implementation detail.  All the remaining users of __GFP_REPEAT
> _have_ to be prepared for the allocation failure. So what exactly is the
> problem with them?

A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed.

> > >  	/* Reclaim has failed us, start killing things */
> > >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> > >  	if (page)
> > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	/* Retry as long as the OOM killer is making progress */
> > >  	if (did_some_progress) {
> > >  		no_progress_loops = 0;
> > > +		passed_oom = true;
> > 
> > This is too premature. did_some_progress != 0 after returning from
> > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
> > that mutex_trylock(&oom_lock) was attempted.
> 
> which means that we have reached the OOM condition and _somebody_ is
> actaully handling the OOM on our behalf.

That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
doing !__GFP_FS allocation), which means that we have reached the OOM condition
and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
as weak as __GFP_NORETRY. I think this is a regression.



> > What I think more important is hearing from __GFP_REPEAT users how hard they
> > want to retry. It is possible that they want to retry unless SIGKILL is
> > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT
> > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE
> > if __GFP_KILLABLE were available. In my module (though I'm not using
> > __GFP_REPEAT), I want to retry unless SIGKILL is delivered.
> 
> To be honest killability for a particular allocation request sounds
> like a hack to me. Just consider the expected semantic. How do you
> handle when one path uses explicit __GFP_KILLABLE while other path (from
> the same syscall) is not... If anything this would have to be process
> context wise.

I didn't catch your question. But making code killable should be considered
good unless it complicates error handling paths.

Since we are not setting TIF_MEMDIE to all OOM-killed threads, OOM-killed
threads will have to loop until mutex_trylock(&oom_lock) succeeds in order
to get TIF_MEMDIE by calling out_of_memory(), which is a needless delay.

Many allocations from syscall context can give up upon SIGKILL. We don't
need to allow OOM-killed threads to use memory reserves if that allocation
is killable.

Converting down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem)
is considered good. But converting kmalloc(GFP_KERNEL) to
kmalloc(GFP_KERNEL | __GFP_KILLABLE) is considered hack. Why?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-11 14:35       ` Tetsuo Handa
@ 2016-06-13 11:37         ` Michal Hocko
  2016-06-13 14:54           ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-13 11:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

On Sat 11-06-16 23:35:49, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> > > Remaining __GFP_REPEAT users are not always doing costly allocations.
> > 
> > Yes but...
> > 
> > > Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.
> > 
> > Would that be a regression though? Strictly speaking the __GFP_REPEAT
> > documentation was explicit to not loop for ever. So nobody should have
> > expected nofail semantic pretty much by definition. The fact that our
> > previous implementation was not fully conforming to the documentation is
> > just an implementation detail.  All the remaining users of __GFP_REPEAT
> > _have_ to be prepared for the allocation failure. So what exactly is the
> > problem with them?
> 
> A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed.

That is true. But it is not weaker than the __GFP_REPEAT actually ever
promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The
fact that we have ignored it is sad but that is what I am trying to
address here.

> > > >  	/* Reclaim has failed us, start killing things */
> > > >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> > > >  	if (page)
> > > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >  	/* Retry as long as the OOM killer is making progress */
> > > >  	if (did_some_progress) {
> > > >  		no_progress_loops = 0;
> > > > +		passed_oom = true;
> > > 
> > > This is too premature. did_some_progress != 0 after returning from
> > > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
> > > that mutex_trylock(&oom_lock) was attempted.
> > 
> > which means that we have reached the OOM condition and _somebody_ is
> > actaully handling the OOM on our behalf.
> 
> That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
> doing !__GFP_FS allocation), which means that we have reached the OOM condition
> and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
> as weak as __GFP_NORETRY. I think this is a regression.

I really fail to see your point. We are talking about a gfp flag which
tells the allocator to retry as much as it is feasible. Getting through
all the reclaim attempts two times without any progress sounds like a
fair criterion. Well, we could try $NUM times but that wouldn't make too
much difference to what you are writing above. The fact whether somebody
has been killed or not is not really that important IMHO.

> > > What I think more important is hearing from __GFP_REPEAT users how hard they
> > > want to retry. It is possible that they want to retry unless SIGKILL is
> > > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT
> > > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE
> > > if __GFP_KILLABLE were available. In my module (though I'm not using
> > > __GFP_REPEAT), I want to retry unless SIGKILL is delivered.
> > 
> > To be honest killability for a particular allocation request sounds
> > like a hack to me. Just consider the expected semantic. How do you
> > handle when one path uses explicit __GFP_KILLABLE while other path (from
> > the same syscall) is not... If anything this would have to be process
> > context wise.
> 
> I didn't catch your question. But making code killable should be considered
> good unless it complicates error handling paths.

What I meant was this
	kmalloc(GFP_KILLABLE)
	func1
	  kmalloc(GFP_KERNEL)

is still not killable context because whatever you call (func1) might
have a different view about killability. So the per allocation context
will not work reliably.

> Since we are not setting TIF_MEMDIE to all OOM-killed threads, OOM-killed
> threads will have to loop until mutex_trylock(&oom_lock) succeeds in order
> to get TIF_MEMDIE by calling out_of_memory(), which is a needless delay.
> 
> Many allocations from syscall context can give up upon SIGKILL. We don't
> need to allow OOM-killed threads to use memory reserves if that allocation
> is killable.
> 
> Converting down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem)
> is considered good. But converting kmalloc(GFP_KERNEL) to
> kmalloc(GFP_KERNEL | __GFP_KILLABLE) is considered hack. Why?

Because unblocking the killable context is meant to help others who want
to take the lock to make a forward progress. While the killable
allocation context is only about the particular allocation to fail when
the task is killed. There is no direct resource to release. So unless
all the allocation in the same scope are killable this will not help
anything. See my point?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-13 11:37         ` Michal Hocko
@ 2016-06-13 14:54           ` Tetsuo Handa
  2016-06-13 15:17             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2016-06-13 14:54 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

Michal Hocko wrote:
> On Sat 11-06-16 23:35:49, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> > > > Remaining __GFP_REPEAT users are not always doing costly allocations.
> > > 
> > > Yes but...
> > > 
> > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> > > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.
> > > 
> > > Would that be a regression though? Strictly speaking the __GFP_REPEAT
> > > documentation was explicit to not loop for ever. So nobody should have
> > > expected nofail semantic pretty much by definition. The fact that our
> > > previous implementation was not fully conforming to the documentation is
> > > just an implementation detail.  All the remaining users of __GFP_REPEAT
> > > _have_ to be prepared for the allocation failure. So what exactly is the
> > > problem with them?
> > 
> > A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed.
> 
> That is true. But it is not weaker than the __GFP_REPEAT actually ever
> promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The
> fact that we have ignored it is sad but that is what I am trying to
> address here.

Whatever you rename __GFP_REPEAT to, it sounds strange to me that !costly
__GFP_REPEAT allocations are weaker than !costly !__GFP_REPEAT allocations.
Are you planning to make !costly !__GFP_REPEAT allocations to behave like
__GFP_NORETRY?

> 
> > > > >  	/* Reclaim has failed us, start killing things */
> > > > >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> > > > >  	if (page)
> > > > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > >  	/* Retry as long as the OOM killer is making progress */
> > > > >  	if (did_some_progress) {
> > > > >  		no_progress_loops = 0;
> > > > > +		passed_oom = true;
> > > > 
> > > > This is too premature. did_some_progress != 0 after returning from
> > > > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
> > > > that mutex_trylock(&oom_lock) was attempted.
> > > 
> > > which means that we have reached the OOM condition and _somebody_ is
> > > actaully handling the OOM on our behalf.
> > 
> > That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
> > doing !__GFP_FS allocation), which means that we have reached the OOM condition
> > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
> > as weak as __GFP_NORETRY. I think this is a regression.
> 
> I really fail to see your point. We are talking about a gfp flag which
> tells the allocator to retry as much as it is feasible. Getting through
> all the reclaim attempts two times without any progress sounds like a
> fair criterion. Well, we could try $NUM times but that wouldn't make too
> much difference to what you are writing above. The fact whether somebody
> has been killed or not is not really that important IMHO.

If all the reclaim attempt first time made no progress, all the reclaim
attempt second time unlikely make progress unless the OOM killer kills
something. Thus, doing all the reclaim attempts two times without any progress
without killing somebody sounds almost equivalent to doing all the reclaim
attempt only once.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-13 14:54           ` Tetsuo Handa
@ 2016-06-13 15:17             ` Michal Hocko
  2016-06-14 11:12               ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-13 15:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

On Mon 13-06-16 23:54:13, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 11-06-16 23:35:49, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> > > > > Remaining __GFP_REPEAT users are not always doing costly allocations.
> > > > 
> > > > Yes but...
> > > > 
> > > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> > > > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.
> > > > 
> > > > Would that be a regression though? Strictly speaking the __GFP_REPEAT
> > > > documentation was explicit to not loop for ever. So nobody should have
> > > > expected nofail semantic pretty much by definition. The fact that our
> > > > previous implementation was not fully conforming to the documentation is
> > > > just an implementation detail.  All the remaining users of __GFP_REPEAT
> > > > _have_ to be prepared for the allocation failure. So what exactly is the
> > > > problem with them?
> > > 
> > > A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed.
> > 
> > That is true. But it is not weaker than the __GFP_REPEAT actually ever
> > promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The
> > fact that we have ignored it is sad but that is what I am trying to
> > address here.
> 
> Whatever you rename __GFP_REPEAT to, it sounds strange to me that !costly
> __GFP_REPEAT allocations are weaker than !costly !__GFP_REPEAT allocations.
> Are you planning to make !costly !__GFP_REPEAT allocations to behave like
> __GFP_NORETRY?

The patch description tries to explain the difference:
__GFP_NORETRY doesn't retry at all
__GFP_RETRY_HARD retries as hard as feasible
__GFP_NOFAIL tells the retry for ever

all of them regardless of the order. This is the way how to tell the
allocator to change its default behavior which might be, and actually
is, different depending on the order.

[...]
> > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
> > > doing !__GFP_FS allocation), which means that we have reached the OOM condition
> > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
> > > as weak as __GFP_NORETRY. I think this is a regression.
> > 
> > I really fail to see your point. We are talking about a gfp flag which
> > tells the allocator to retry as much as it is feasible. Getting through
> > all the reclaim attempts two times without any progress sounds like a
> > fair criterion. Well, we could try $NUM times but that wouldn't make too
> > much difference to what you are writing above. The fact whether somebody
> > has been killed or not is not really that important IMHO.
> 
> If all the reclaim attempt first time made no progress, all the reclaim
> attempt second time unlikely make progress unless the OOM killer kills
> something. Thus, doing all the reclaim attempts two times without any progress
> without killing somebody sounds almost equivalent to doing all the reclaim
> attempt only once.

Yes, that is possible. You might have a GFP_NOFS only load where nothing
really invokes the OOM killer. Does that actually matter, though? The
semantic of the flag is to retry hard while the page allocator believes
it can make a forward progress. But not for ever. We never know whether
a progress is possible at all. We have certain heuristics when to give
up, try to invoke OOM killer and try again hoping things have changed.
This is not much different except we declare that no hope to getting to
the OOM point again without being able to succeed. Are you suggesting
a more precise heuristic? Or do you claim that we do not need a flag
which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL
which are on the extreme sides?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-13 15:17             ` Michal Hocko
@ 2016-06-14 11:12               ` Tetsuo Handa
  2016-06-14 18:54                 ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2016-06-14 11:12 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

Michal Hocko wrote:
> > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
> > > > doing !__GFP_FS allocation), which means that we have reached the OOM condition
> > > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
> > > > as weak as __GFP_NORETRY. I think this is a regression.
> > > 
> > > I really fail to see your point. We are talking about a gfp flag which
> > > tells the allocator to retry as much as it is feasible. Getting through
> > > all the reclaim attempts two times without any progress sounds like a
> > > fair criterion. Well, we could try $NUM times but that wouldn't make too
> > > much difference to what you are writing above. The fact whether somebody
> > > has been killed or not is not really that important IMHO.
> > 
> > If all the reclaim attempt first time made no progress, all the reclaim
> > attempt second time unlikely make progress unless the OOM killer kills
> > something. Thus, doing all the reclaim attempts two times without any progress
> > without killing somebody sounds almost equivalent to doing all the reclaim
> > attempt only once.
> 
> Yes, that is possible. You might have a GFP_NOFS only load where nothing
> really invokes the OOM killer. Does that actually matter, though? The
> semantic of the flag is to retry hard while the page allocator believes
> it can make a forward progress. But not for ever. We never know whether
> a progress is possible at all. We have certain heuristics when to give
> up, try to invoke OOM killer and try again hoping things have changed.
> This is not much different except we declare that no hope to getting to
> the OOM point again without being able to succeed. Are you suggesting
> a more precise heuristic? Or do you claim that we do not need a flag
> which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL
> which are on the extreme sides?

Well, maybe we can get rid of __GFP_RETRY (or make __GFP_RETRY used for only
huge pages). Many __GFP_RETRY users are ready to fall back to vmalloc().

We are not sure whether such __GFP_RETRY users want to retry with OOM-killing
somebody (we don't have __GFP_MAY_OOM_KILL which explicitly asks for "retry
with OOM-killing somebody").

If __GFP_RETRY means nothing but try once more,

	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
	if (!n)
		n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);

will emulate it.



----- arch/powerpc/include/asm/book3s/64/pgalloc.h -----

static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
{
#ifdef CONFIG_PPC_64K_PAGES
        return (pgd_t *)__get_free_page(PGALLOC_GFP);
#else
        struct page *page;
        page = alloc_pages(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO | __GFP_REPEAT, 4);
        if (!page)
                return NULL;
        return (pgd_t *) page_address(page);
#endif
}

----- arch/powerpc/kvm/book3s_64_mmu_hv.c -----

        kvm->arch.hpt_cma_alloc = 0;
        page = kvm_alloc_hpt(1ul << (order - PAGE_SHIFT));
        if (page) {
                hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
                memset((void *)hpt, 0, (1ul << order));
                kvm->arch.hpt_cma_alloc = 1;
        }

        /* Lastly try successively smaller sizes from the page allocator */
        /* Only do this if userspace didn't specify a size via ioctl */
        while (!hpt && order > 18 && !htab_orderp) {
                hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
                                       __GFP_NOWARN, order - PAGE_SHIFT);
                if (!hpt)
                        --order;
        }

        if (!hpt)
                return -ENOMEM;

----- drivers/vhost/vhost.c -----

static void *vhost_kvzalloc(unsigned long size)
{
        void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);

        if (!n)
                n = vzalloc(size);
        return n;
}

----- drivers/vhost/scsi.c -----

        vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!vs) {
                vs = vzalloc(sizeof(*vs));
                if (!vs)
                        goto err_vs;
        }

----- drivers/vhost/net.c -----

        n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!n) {
                n = vmalloc(sizeof *n);
                if (!n)
                        return -ENOMEM;
        }

----- drivers/block/xen-blkfront.c -----

                /* Stage 1: Make a safe copy of the shadow state. */
                copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
                               GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
                if (!copy)
                        return -ENOMEM;

----- drivers/mmc/host/wbsd.c -----

        /*
         * We need to allocate a special buffer in
         * order for ISA to be able to DMA to it.
         */
        host->dma_buffer = kmalloc(65536,
                GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
        if (!host->dma_buffer)
                goto free;

----- drivers/target/target_core_transport.c -----

        se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
                                        GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!se_sess->sess_cmd_map) {
                se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
                if (!se_sess->sess_cmd_map) {
                        pr_err("Unable to allocate se_sess->sess_cmd_map\n");
                        return -ENOMEM;
                }
        }

----- drivers/s390/char/vmcp.c -----

        if (mutex_lock_interruptible(&session->mutex)) {
                kfree(cmd);
                return -ERESTARTSYS;
        }
        if (!session->response)
                session->response = (char *)__get_free_pages(GFP_KERNEL
                                                | __GFP_REPEAT | GFP_DMA,
                                                get_order(session->bufsize));
        if (!session->response) {
                mutex_unlock(&session->mutex);
                kfree(cmd);
                return -ENOMEM;
        }

----- fs/btrfs/raid56.c -----

        table_size = sizeof(*table) + sizeof(*h) * num_entries;
        table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!table) {
                table = vzalloc(table_size);
                if (!table)
                        return -ENOMEM;
        }

----- fs/btrfs/check-integrity.c -----

        state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!state) {
                state = vzalloc(sizeof(*state));
                if (!state) {
                        printk(KERN_INFO "btrfs check-integrity: vzalloc() failed!\n");
                        return -1;
                }
        }

----- mm/sparse-vmemmap.c -----

void * __meminit vmemmap_alloc_block(unsigned long size, int node)
{
        /* If the main allocator is up use that, fallback to bootmem. */
        if (slab_is_available()) {
                struct page *page;

                if (node_state(node, N_HIGH_MEMORY))
                        page = alloc_pages_node(
                                node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
                                get_order(size));
                else
                        page = alloc_pages(
                                GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
                                get_order(size));
                if (page)
                        return page_address(page);
                return NULL;
        } else
                return __earlyonly_bootmem_alloc(node, size, size,
                                __pa(MAX_DMA_ADDRESS));
}

----- mm/hugetlb.c -----

static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
        struct page *page;

        page = __alloc_pages_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));
        if (page) {
                prep_new_huge_page(h, page, nid);
        }

        return page;
}

static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
                struct vm_area_struct *vma, unsigned long addr, int nid)
{
        int order = huge_page_order(h);
        gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
        unsigned int cpuset_mems_cookie;

----- net/core/skbuff.c -----

        gfp_head = gfp_mask;
        if (gfp_head & __GFP_DIRECT_RECLAIM)
                gfp_head |= __GFP_REPEAT;

        *errcode = -ENOBUFS;
        skb = alloc_skb(header_len, gfp_head);
        if (!skb)
                return NULL;

----- net/core/dev.c -----

        rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!rx) {
                rx = vzalloc(sz);
                if (!rx)
                        return -ENOMEM;
        }
        dev->_rx = rx;

        tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!tx) {
                tx = vzalloc(sz);
                if (!tx)
                        return -ENOMEM;
        }
        dev->_tx = tx;

        p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
        if (!p)
                p = vzalloc(alloc_size);
        if (!p)
                return NULL;

----- net/sched/sch_fq.c -----

static void *fq_alloc_node(size_t sz, int node)
{
        void *ptr;

        ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
        if (!ptr)
                ptr = vmalloc_node(sz, node);
        return ptr;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
  2016-06-14 11:12               ` Tetsuo Handa
@ 2016-06-14 18:54                 ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-06-14 18:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel

On Tue 14-06-16 20:12:08, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
> > > > > doing !__GFP_FS allocation), which means that we have reached the OOM condition
> > > > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
> > > > > as weak as __GFP_NORETRY. I think this is a regression.
> > > > 
> > > > I really fail to see your point. We are talking about a gfp flag which
> > > > tells the allocator to retry as much as it is feasible. Getting through
> > > > all the reclaim attempts two times without any progress sounds like a
> > > > fair criterion. Well, we could try $NUM times but that wouldn't make too
> > > > much difference to what you are writing above. The fact whether somebody
> > > > has been killed or not is not really that important IMHO.
> > > 
> > > If all the reclaim attempt first time made no progress, all the reclaim
> > > attempt second time unlikely make progress unless the OOM killer kills
> > > something. Thus, doing all the reclaim attempts two times without any progress
> > > without killing somebody sounds almost equivalent to doing all the reclaim
> > > attempt only once.
> > 
> > Yes, that is possible. You might have a GFP_NOFS only load where nothing
> > really invokes the OOM killer. Does that actually matter, though? The
> > semantic of the flag is to retry hard while the page allocator believes
> > it can make a forward progress. But not for ever. We never know whether
> > a progress is possible at all. We have certain heuristics when to give
> > up, try to invoke OOM killer and try again hoping things have changed.
> > This is not much different except we declare that no hope to getting to
> > the OOM point again without being able to succeed. Are you suggesting
> > a more precise heuristic? Or do you claim that we do not need a flag
> > which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL
> > which are on the extreme sides?
> 
> Well, maybe we can get rid of __GFP_RETRY (or make __GFP_RETRY used for only
> huge pages). Many __GFP_RETRY users are ready to fall back to vmalloc().

But some of them should try hard before they fall back to vmalloc. And
let me repeat, there valid usecases when you want to to tell the
allocator to not retry !costly requests for ever.

> We are not sure whether such __GFP_RETRY users want to retry with OOM-killing
> somebody (we don't have __GFP_MAY_OOM_KILL which explicitly asks for "retry
> with OOM-killing somebody").

And we do not want something like __GFP_MAY_OOM_KILL. We have
__GFP_NORETRY to tell to bail out early. We do not want callers to
control the OOM behavior. That is an MM internal thing IMHO.

> If __GFP_RETRY means nothing but try once more,
> 
> 	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> 	if (!n)
> 		n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> 
> will emulate it.

It won't. __GFP_NORETRY is way too weak because it only invokes
optimistic compaction so the success rate would be really small even if
you retry with the same flag multiple times in a row. We definitely need
a stronger mode to tell that the allocator should really try hard before
it fails.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko
@ 2016-06-16  0:23   ` Dave Chinner
  2016-06-16  8:03     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-06-16  0:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Rik van Riel, LKML, Michal Hocko

On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_HARD flags which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..34e6b062ce0e 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_HARD;
> +

I don't understand. KM_MAYFAIL means "caller handles
allocation failure, so retry on failure is not required." To then
map KM_MAYFAIL to a flag that implies the allocation will internally
retry to try exceptionally hard to prevent failure seems wrong.

IOWs, KM_MAYFAIL means XFS is just using for normal allocator
behaviour here, so I'm not sure what problem this change is actually
solving and it's not clear from the description....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-16  0:23   ` Dave Chinner
@ 2016-06-16  8:03     ` Michal Hocko
  2016-06-16 11:26       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-16  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Rik van Riel, LKML

On Thu 16-06-16 10:23:02, Dave Chinner wrote:
> On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > so it relied on the default page allocator behavior for the given set
> > of flags. This means that small allocations actually never failed.
> > 
> > Now that we have __GFP_RETRY_HARD flags which works independently on the
> > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > try as hard as it can to fulfill the request but fails eventually if
> > the progress cannot be made.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.h | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index 689f746224e7..34e6b062ce0e 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  			lflags &= ~__GFP_FS;
> >  	}
> >  
> > +	if (flags & KM_MAYFAIL)
> > +		lflags |= __GFP_RETRY_HARD;
> > +
> 
> I don't understand. KM_MAYFAIL means "caller handles
> allocation failure, so retry on failure is not required." To then
> map KM_MAYFAIL to a flag that implies the allocation will internally
> retry to try exceptionally hard to prevent failure seems wrong.

The primary point, which I've tried to describe in the changelog, is
that the default allocator behavior is to retry endlessly for small
orders. You can override this by using __GFP_NORETRY which doesn't retry
at all and fails quite early. My understanding of KM_MAYFAIL is that
it can cope with allocation failures. The lack of __GFP_NORETRY made me
think that the failure should be prevented as much as possible.
__GFP_RETRY_HARD is semantically somwhere in the middle between
__GFP_NORETRY and __GFP_NOFAIL semantic independently on the allocation
size.

Does that make more sense now?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-16  8:03     ` Michal Hocko
@ 2016-06-16 11:26       ` Michal Hocko
  2016-06-17 18:22         ` Johannes Weiner
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-16 11:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Rik van Riel, LKML

On Thu 16-06-16 10:03:55, Michal Hocko wrote:
> On Thu 16-06-16 10:23:02, Dave Chinner wrote:
> > On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> > > so it relied on the default page allocator behavior for the given set
> > > of flags. This means that small allocations actually never failed.
> > > 
> > > Now that we have __GFP_RETRY_HARD flags which works independently on the
> > > allocation request size we can map KM_MAYFAIL to it. The allocator will
> > > try as hard as it can to fulfill the request but fails eventually if
> > > the progress cannot be made.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/xfs/kmem.h | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > > index 689f746224e7..34e6b062ce0e 100644
> > > --- a/fs/xfs/kmem.h
> > > +++ b/fs/xfs/kmem.h
> > > @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
> > >  			lflags &= ~__GFP_FS;
> > >  	}
> > >  
> > > +	if (flags & KM_MAYFAIL)
> > > +		lflags |= __GFP_RETRY_HARD;
> > > +
> > 
> > I don't understand. KM_MAYFAIL means "caller handles
> > allocation failure, so retry on failure is not required." To then
> > map KM_MAYFAIL to a flag that implies the allocation will internally
> > retry to try exceptionally hard to prevent failure seems wrong.
> 
> The primary point, which I've tried to describe in the changelog, is
> that the default allocator behavior is to retry endlessly for small
> orders. You can override this by using __GFP_NORETRY which doesn't retry
> at all and fails quite early. My understanding of KM_MAYFAIL is that
> it can cope with allocation failures. The lack of __GFP_NORETRY made me
> think that the failure should be prevented as much as possible.
> __GFP_RETRY_HARD is semantically somwhere in the middle between
> __GFP_NORETRY and __GFP_NOFAIL semantic independently on the allocation
> size.
> 
> Does that make more sense now?

I would add the following explanation into the code:
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 34e6b062ce0e..10708f065191 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags)
 			lflags &= ~__GFP_FS;
 	}
 
+	/*
+	 * Default page/slab allocator behavior is to retry for ever
+	 * for small allocations. We can override this behavior by using
+	 * __GFP_RETRY_HARD which will tell the allocator to retry as long
+	 * as it is feasible but rather fail than retry for ever for all
+	 * request sizes.
+	 */
 	if (flags & KM_MAYFAIL)
 		lflags |= __GFP_RETRY_HARD;
 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-16 11:26       ` Michal Hocko
@ 2016-06-17 18:22         ` Johannes Weiner
  2016-06-17 20:30           ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2016-06-17 18:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Rik van Riel, LKML

On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote:
> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_HARD which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry for ever for all
> +	 * request sizes.
> +	 */
>  	if (flags & KM_MAYFAIL)
>  		lflags |= __GFP_RETRY_HARD;

I think this example shows that __GFP_RETRY_HARD is not a good flag
because it conflates two seemingly unrelated semantics; the comment
doesn't quite make up for that.

When the flag is set,

- it allows costly orders to invoke the OOM killer and retry
- it allows !costly orders to fail

While 1. is obvious from the name, 2. is not. Even if we don't want
full-on fine-grained naming for every reclaim methodology and retry
behavior, those two things just shouldn't be tied together.

I don't see us failing !costly order per default anytime soon, and
they are common, so adding a __GFP_MAYFAIL to explicitely override
that behavior seems like a good idea to me. That would make the XFS
callsite here perfectly obvious.

And you can still combine it with __GFP_REPEAT.

For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT
does the right thing for all orders, and it's self-explanatory: try
hard, allow falling back.

Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different
topic. In the long term, it might be better to provide best-effort per
default and simply annotate MAYFAIL/NORETRY callsites that want to
give up earlier. Because as I mentioned at LSFMM, it's much easier to
identify callsites that have a convenient fallback than callsites that
need to "try harder." Everybody thinks their allocations are oh so
important. The former is much more specific and uses obvious criteria.

Either way, __GFP_MAYFAIL should be on its own.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-17 18:22         ` Johannes Weiner
@ 2016-06-17 20:30           ` Vlastimil Babka
  2016-06-17 21:39             ` Johannes Weiner
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2016-06-17 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML

On 17.6.2016 20:22, Johannes Weiner wrote:
> On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote:
>> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags)
>>  			lflags &= ~__GFP_FS;
>>  	}
>>  
>> +	/*
>> +	 * Default page/slab allocator behavior is to retry for ever
>> +	 * for small allocations. We can override this behavior by using
>> +	 * __GFP_RETRY_HARD which will tell the allocator to retry as long
>> +	 * as it is feasible but rather fail than retry for ever for all
>> +	 * request sizes.
>> +	 */
>>  	if (flags & KM_MAYFAIL)
>>  		lflags |= __GFP_RETRY_HARD;
> 
> I think this example shows that __GFP_RETRY_HARD is not a good flag
> because it conflates two seemingly unrelated semantics; the comment
> doesn't quite make up for that.
> 
> When the flag is set,
> 
> - it allows costly orders to invoke the OOM killer and retry

No, it's not allowing the OOM killer for costly orders, only non-costly, AFAIK.
Mainly it allows more aggressive compaction (especially after my series [1]).

> - it allows !costly orders to fail
> 
> While 1. is obvious from the name, 2. is not. Even if we don't want
> full-on fine-grained naming for every reclaim methodology and retry
> behavior, those two things just shouldn't be tied together.

Well, if allocation is not allowed to fail, it's like trying "indefinitely hard"
already. Telling it it should "try hard" then doesn't make any sense without
also being able to fail.

> I don't see us failing !costly order per default anytime soon, and
> they are common, so adding a __GFP_MAYFAIL to explicitely override
> that behavior seems like a good idea to me. That would make the XFS
> callsite here perfectly obvious.
> 
> And you can still combine it with __GFP_REPEAT.

But that would mean the following meaningful combinations for non-costly orders
(assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place).

__GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail

__GFP_MAYFAIL - ???
__GFP_MAYFAIL | __GFP_REPEAT - ???

Which one of the last two tries harder? How specifically? Will they differ by
(not) allowing OOM? Won't that be just extra confusing?

> For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT
> does the right thing for all orders, and it's self-explanatory: try
> hard, allow falling back.
> 
> Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different
> topic. In the long term, it might be better to provide best-effort per
> default and simply annotate MAYFAIL/NORETRY callsites that want to
> give up earlier. Because as I mentioned at LSFMM, it's much easier to
> identify callsites that have a convenient fallback than callsites that
> need to "try harder." Everybody thinks their allocations are oh so
> important. The former is much more specific and uses obvious criteria.

For higher-order allocations, best-effort might also mean significant system
disruption, not just latency of the allocation itself. One example is hugeltbfs
allocations (echo X > .../nr_hugepages) where the admin is willing to pay this
cost. But to do that by default and rely on everyone else passing NORETRY
wouldn't go far. So I think the TRY_HARD kind of flag makes sense.

> Either way, __GFP_MAYFAIL should be on its own.

[1] https://lwn.net/Articles/689154/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-17 20:30           ` Vlastimil Babka
@ 2016-06-17 21:39             ` Johannes Weiner
  2016-06-20  8:08               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2016-06-17 21:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Dave Chinner, linux-mm, Andrew Morton, Mel Gorman,
	Rik van Riel, LKML

On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote:
> On 17.6.2016 20:22, Johannes Weiner wrote:
> > On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote:
> >> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >>  			lflags &= ~__GFP_FS;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Default page/slab allocator behavior is to retry for ever
> >> +	 * for small allocations. We can override this behavior by using
> >> +	 * __GFP_RETRY_HARD which will tell the allocator to retry as long
> >> +	 * as it is feasible but rather fail than retry for ever for all
> >> +	 * request sizes.
> >> +	 */
> >>  	if (flags & KM_MAYFAIL)
> >>  		lflags |= __GFP_RETRY_HARD;
> > 
> > I think this example shows that __GFP_RETRY_HARD is not a good flag
> > because it conflates two seemingly unrelated semantics; the comment
> > doesn't quite make up for that.
> > 
> > When the flag is set,
> > 
> > - it allows costly orders to invoke the OOM killer and retry
> 
> No, it's not allowing the OOM killer for costly orders, only non-costly, AFAIK.
> Mainly it allows more aggressive compaction (especially after my series [1]).

Ah, you're right. It calls into the may_oom function but that skips
actual killing for costly orders.

> > - it allows !costly orders to fail
> > 
> > While 1. is obvious from the name, 2. is not. Even if we don't want
> > full-on fine-grained naming for every reclaim methodology and retry
> > behavior, those two things just shouldn't be tied together.
> 
> Well, if allocation is not allowed to fail, it's like trying "indefinitely hard"
> already. Telling it it should "try hard" then doesn't make any sense without
> also being able to fail.

I can see that argument, but it's really anything but obvious at the
callsite. Dave's response to Michal's patch was a good demonstration.
And I don't think adding comments fixes an unintuitive interface.

> > I don't see us failing !costly order per default anytime soon, and
> > they are common, so adding a __GFP_MAYFAIL to explicitely override
> > that behavior seems like a good idea to me. That would make the XFS
> > callsite here perfectly obvious.
> > 
> > And you can still combine it with __GFP_REPEAT.
> 
> But that would mean the following meaningful combinations for non-costly orders
> (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place).

I would ignore order here. Part of what makes this interface
unintuitive is when we expect different flags to be passed for
different orders, especially because the orders are often
variable. Michal's __GFP_RETRY_HARD is an improvement in the sense
that it ignores the order and tries to do the right thing regardless
of it. The interface should really be about the intent at the
callsite, not about implementation details of the allocator.

But adding TRY_HARD to express "this can fail" isn't intuitive.

> __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail

Yeah. Never OOM, never retry etc. The callsite can fall back, and
prefers that over OOM kills and disruptive allocation latencies.

> __GFP_MAYFAIL - ???

May OOM for certain orders and retry a few times, but still fail. The
callsite can fall back, but it wouldn't come for free. E.g. it might
have to abort an explicitely requested user operation.

This is the default for costly orders, so it has an effect only on
non-costly orders. But that's where I would separate interface from
implementation: you'd use it e.g. in callsites where you have variable
orders but always the same fallback. XFS does that extensively.

> __GFP_MAYFAIL | __GFP_REPEAT - ???
> 
> Which one of the last two tries harder? How specifically? Will they differ by
> (not) allowing OOM? Won't that be just extra confusing?

Adding __GFP_REPEAT would always be additive. This combination would
mean: try the hardest not to fail, but don't lock up in cases when the
order happens to be !costly.

Again, I'm not too thrilled about that flag as it's so damn vague. But
that's more about how we communicate latency/success expectations. My
concern is exclusively about its implication of MAYFAIL.

> > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT
> > does the right thing for all orders, and it's self-explanatory: try
> > hard, allow falling back.
> > 
> > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different
> > topic. In the long term, it might be better to provide best-effort per
> > default and simply annotate MAYFAIL/NORETRY callsites that want to
> > give up earlier. Because as I mentioned at LSFMM, it's much easier to
> > identify callsites that have a convenient fallback than callsites that
> > need to "try harder." Everybody thinks their allocations are oh so
> > important. The former is much more specific and uses obvious criteria.
> 
> For higher-order allocations, best-effort might also mean significant system
> disruption, not just latency of the allocation itself. One example is hugeltbfs
> allocations (echo X > .../nr_hugepages) where the admin is willing to pay this
> cost. But to do that by default and rely on everyone else passing NORETRY
> wouldn't go far. So I think the TRY_HARD kind of flag makes sense.

I think whether the best-effort behavior should be opt-in or opt-out,
or how fine-grained the latency/success control over the allocator
should be is a different topic. I'd prefer defaulting to reliability
and annotating low-latency requirements, but I can see TRY_HARD work
too. It just shouldn't imply MAY_FAIL.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-17 21:39             ` Johannes Weiner
@ 2016-06-20  8:08               ` Michal Hocko
  2016-06-21  4:22                 ` Johannes Weiner
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-06-20  8:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton,
	Mel Gorman, Rik van Riel, LKML

On Fri 17-06-16 17:39:31, Johannes Weiner wrote:
> On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote:
> > On 17.6.2016 20:22, Johannes Weiner wrote:
[...]
> > > - it allows !costly orders to fail
> > > 
> > > While 1. is obvious from the name, 2. is not. Even if we don't want
> > > full-on fine-grained naming for every reclaim methodology and retry
> > > behavior, those two things just shouldn't be tied together.
> > 
> > Well, if allocation is not allowed to fail, it's like trying "indefinitely hard"
> > already. Telling it it should "try hard" then doesn't make any sense without
> > also being able to fail.
> 
> I can see that argument, but it's really anything but obvious at the
> callsite. Dave's response to Michal's patch was a good demonstration.
> And I don't think adding comments fixes an unintuitive interface.

Yeah, I am aware of that. And it is unfortunate but a side effect of our
!costly vs. costly difference in the default behavior. What I wanted
to achieve was to have overrides for the default behavior (whatever it
is). We already have two such flags and having something semantically in
the middle sounds like a consistent way to me.

> > > I don't see us failing !costly order per default anytime soon, and
> > > they are common, so adding a __GFP_MAYFAIL to explicitely override
> > > that behavior seems like a good idea to me. That would make the XFS
> > > callsite here perfectly obvious.
> > > 
> > > And you can still combine it with __GFP_REPEAT.
> > 
> > But that would mean the following meaningful combinations for non-costly orders
> > (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place).
> 
> I would ignore order here. Part of what makes this interface
> unintuitive is when we expect different flags to be passed for
> different orders, especially because the orders are often
> variable. Michal's __GFP_RETRY_HARD is an improvement in the sense
> that it ignores the order and tries to do the right thing regardless
> of it. The interface should really be about the intent at the
> callsite, not about implementation details of the allocator.
>
> But adding TRY_HARD to express "this can fail" isn't intuitive.

I am all for a better name but everything else I could come up with was
just more confusing. Take __GFP_MAYFAIL as an example. How it would be
any less confusing? Aren't all the requests which do not have
__GFP_NOFAIL automatically MAYFAIL? RETRY_HARD was an attempt to tell
you can retry as hard as you find reasonable but fail eventually which
should fit quite nicely between NORETRY and NOFAIL.

> > __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail
> 
> Yeah. Never OOM, never retry etc. The callsite can fall back, and
> prefers that over OOM kills and disruptive allocation latencies.
> 
> > __GFP_MAYFAIL - ???
> 
> May OOM for certain orders and retry a few times, but still fail. The
> callsite can fall back, but it wouldn't come for free. E.g. it might
> have to abort an explicitely requested user operation.
> 
> This is the default for costly orders, so it has an effect only on
> non-costly orders. But that's where I would separate interface from
> implementation: you'd use it e.g. in callsites where you have variable
> orders but always the same fallback. XFS does that extensively.
> 
> > __GFP_MAYFAIL | __GFP_REPEAT - ???
> > 
> > Which one of the last two tries harder? How specifically? Will they differ by
> > (not) allowing OOM? Won't that be just extra confusing?
> 
> Adding __GFP_REPEAT would always be additive. This combination would
> mean: try the hardest not to fail, but don't lock up in cases when the
> order happens to be !costly.
> 
> Again, I'm not too thrilled about that flag as it's so damn vague. But
> that's more about how we communicate latency/success expectations. My
> concern is exclusively about its implication of MAYFAIL.

Our gfp flags space is quite full and additing a new flag while we keep
one with a vague meaning doesn't sound very well to me. So I really
think we should just ditch __GFP_REPEAT. Whether __GFP_MAYFAIL is a
better name for the new flag I dunno. It feels confusing to me but if
that is a general agreement I don't have a big problem with that.

> > > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT
> > > does the right thing for all orders, and it's self-explanatory: try
> > > hard, allow falling back.
> > > 
> > > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different
> > > topic. In the long term, it might be better to provide best-effort per
> > > default and simply annotate MAYFAIL/NORETRY callsites that want to
> > > give up earlier. Because as I mentioned at LSFMM, it's much easier to
> > > identify callsites that have a convenient fallback than callsites that
> > > need to "try harder." Everybody thinks their allocations are oh so
> > > important. The former is much more specific and uses obvious criteria.
> > 
> > For higher-order allocations, best-effort might also mean significant system
> > disruption, not just latency of the allocation itself. One example is hugeltbfs
> > allocations (echo X > .../nr_hugepages) where the admin is willing to pay this
> > cost. But to do that by default and rely on everyone else passing NORETRY
> > wouldn't go far. So I think the TRY_HARD kind of flag makes sense.
> 
> I think whether the best-effort behavior should be opt-in or opt-out,
> or how fine-grained the latency/success control over the allocator
> should be is a different topic. I'd prefer defaulting to reliability
> and annotating low-latency requirements, but I can see TRY_HARD work
> too. It just shouldn't imply MAY_FAIL.

It is always hard to change the default behavior without breaking
anything. Up to now we had opt-in and as you can see there are not that
many users who really wanted to have higher reliability. I guess this is
because they just do not care and didn't see too many failures. The
opt-out has also a disadvantage that we would need to provide a flag
to tell to try less hard and all we have is NORETRY and that is way too
easy. So to me it sounds like the opt-in fits better with the current
usage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-20  8:08               ` Michal Hocko
@ 2016-06-21  4:22                 ` Johannes Weiner
  2016-06-21  9:29                   ` Vlastimil Babka
  2016-06-21 17:00                   ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-06-21  4:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton,
	Mel Gorman, Rik van Riel, LKML

On Mon, Jun 20, 2016 at 10:08:56AM +0200, Michal Hocko wrote:
> On Fri 17-06-16 17:39:31, Johannes Weiner wrote:
> > On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote:
> > > On 17.6.2016 20:22, Johannes Weiner wrote:
> [...]
> > > > - it allows !costly orders to fail
> > > > 
> > > > While 1. is obvious from the name, 2. is not. Even if we don't want
> > > > full-on fine-grained naming for every reclaim methodology and retry
> > > > behavior, those two things just shouldn't be tied together.
> > > 
> > > Well, if allocation is not allowed to fail, it's like trying "indefinitely hard"
> > > already. Telling it it should "try hard" then doesn't make any sense without
> > > also being able to fail.
> > 
> > I can see that argument, but it's really anything but obvious at the
> > callsite. Dave's response to Michal's patch was a good demonstration.
> > And I don't think adding comments fixes an unintuitive interface.
> 
> Yeah, I am aware of that. And it is unfortunate but a side effect of our
> !costly vs. costly difference in the default behavior. What I wanted
> to achieve was to have overrides for the default behavior (whatever it
> is). We already have two such flags and having something semantically in
> the middle sounds like a consistent way to me.

The "whatever it is" is the problem I'm having. It's one flag that
does two entirely orthogonal things, and it's quite reasonable for
somebody to want to change one behavior without the other.

"I can handle allocation failures, even when they are !costly"

has really nothing to do with

"I am ready to pay high a allocation latency to make costly
allocations succeed."

So while I understand that you want an effort-flag leveled somewhere
between NORETRY and NOFAIL, this looks more like a theoretical thing
than what existing callsites actually would want to use.

> > I think whether the best-effort behavior should be opt-in or opt-out,
> > or how fine-grained the latency/success control over the allocator
> > should be is a different topic. I'd prefer defaulting to reliability
> > and annotating low-latency requirements, but I can see TRY_HARD work
> > too. It just shouldn't imply MAY_FAIL.
> 
> It is always hard to change the default behavior without breaking
> anything. Up to now we had opt-in and as you can see there are not that
> many users who really wanted to have higher reliability. I guess this is
> because they just do not care and didn't see too many failures. The
> opt-out has also a disadvantage that we would need to provide a flag
> to tell to try less hard and all we have is NORETRY and that is way too
> easy. So to me it sounds like the opt-in fits better with the current
> usage.

For costly allocations, the presence of __GFP_NORETRY is exactly the
same as the absence of __GFP_REPEAT. So if we made __GFP_REPEAT the
default (and deleted the flag), the opt-outs would use __GFP_NORETRY
to restore their original behavior.

As for changing the default - remember that we currently warn about
allocation failures as if they were bugs, unless they are explicitely
allocated with the __GFP_NOWARN flag. We can assume that the current
__GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back
rather than incurring latency (otherwise they would have added the
__GFP_REPEAT flag). These sites would be a good list of candidates to
annotate with __GFP_NORETRY. If we made __GFP_REPEAT then the default,
the sites that would then try harder are the same sites that would now
emit page allocation failure warnings. These are rare, and the only
times I have seen them is under enough load that latency is shot to
hell anyway. So I'm not really convinced by the regression argument.

But that would *actually* clean up the flags, not make them even more
confusing:

Allocations that can't ever handle failure would use __GFP_NOFAIL.

Callers like XFS would use __GFP_MAYFAIL specifically to disable the
implicit __GFP_NOFAIL of !costly allocations.

Callers that would prefer falling back over killing and looping would
use __GFP_NORETRY.

Wouldn't that cover all usecases and be much more intuitive, both in
the default behavior as well as in the names of the flags?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-21  4:22                 ` Johannes Weiner
@ 2016-06-21  9:29                   ` Vlastimil Babka
  2016-06-21 17:00                   ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2016-06-21  9:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML

On 06/21/2016 06:22 AM, Johannes Weiner wrote:
>>> I think whether the best-effort behavior should be opt-in or opt-out,
>>> or how fine-grained the latency/success control over the allocator
>>> should be is a different topic. I'd prefer defaulting to reliability
>>> and annotating low-latency requirements, but I can see TRY_HARD work
>>> too. It just shouldn't imply MAY_FAIL.
>>
>> It is always hard to change the default behavior without breaking
>> anything. Up to now we had opt-in and as you can see there are not that
>> many users who really wanted to have higher reliability. I guess this is
>> because they just do not care and didn't see too many failures. The
>> opt-out has also a disadvantage that we would need to provide a flag
>> to tell to try less hard and all we have is NORETRY and that is way too
>> easy. So to me it sounds like the opt-in fits better with the current
>> usage.
>
> For costly allocations, the presence of __GFP_NORETRY is exactly the
> same as the absence of __GFP_REPEAT. So if we made __GFP_REPEAT the
> default (and deleted the flag), the opt-outs would use __GFP_NORETRY
> to restore their original behavior.

Just FYI, this argument distorts my idea how to get rid of hacky checks 
for GFP_TRANSHUGE and PF_KTHREAD (patches 05 and 06 in [1]), where I 
observed the mentioned no difference between __GFP_NORETRY presence and 
__GFP_REPEAT absence, and made use of it. Without __GFP_REPEAT I'd have 
two options for khugepaged and madvise(MADV_HUGEPAGE) allocations. 
Either pass __GFP_NORETRY and make them fail more, or don't and then 
they become much more disruptive (if the default becomes best-effort, 
i.e. what __GFP_REPEAT used to do).

[1] http://thread.gmane.org/gmane.linux.kernel.mm/152313

> As for changing the default - remember that we currently warn about
> allocation failures as if they were bugs, unless they are explicitely
> allocated with the __GFP_NOWARN flag. We can assume that the current
> __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back
> rather than incurring latency (otherwise they would have added the
> __GFP_REPEAT flag). These sites would be a good list of candidates to
> annotate with __GFP_NORETRY. If we made __GFP_REPEAT then the default,
> the sites that would then try harder are the same sites that would now
> emit page allocation failure warnings. These are rare, and the only
> times I have seen them is under enough load that latency is shot to
> hell anyway. So I'm not really convinced by the regression argument.
>
> But that would *actually* clean up the flags, not make them even more
> confusing:
>
> Allocations that can't ever handle failure would use __GFP_NOFAIL.
>
> Callers like XFS would use __GFP_MAYFAIL specifically to disable the
> implicit __GFP_NOFAIL of !costly allocations.
>
> Callers that would prefer falling back over killing and looping would
> use __GFP_NORETRY.
>
> Wouldn't that cover all usecases and be much more intuitive, both in
> the default behavior as well as in the names of the flags?
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD
  2016-06-21  4:22                 ` Johannes Weiner
  2016-06-21  9:29                   ` Vlastimil Babka
@ 2016-06-21 17:00                   ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-06-21 17:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton,
	Mel Gorman, Rik van Riel, LKML

On Tue 21-06-16 00:22:49, Johannes Weiner wrote:
[...]
> As for changing the default - remember that we currently warn about
> allocation failures as if they were bugs, unless they are explicitely
> allocated with the __GFP_NOWARN flag. We can assume that the current
> __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back
> rather than incurring latency (otherwise they would have added the
> __GFP_REPEAT flag). These sites would be a good list of candidates to
> annotate with __GFP_NORETRY.

This sounds like a good idea at first sight but a brief git grep shows
that many of them are just trying to silence the warning from non
sleeping allocations which are quite likely to fail. This wouldn't
be hard to filter out so we can ignore them.

Then there are things like 8be04b9374e5 ("treewide: Add __GFP_NOWARN
to k.alloc calls with v.alloc fallbacks") where the flag is added to
many places with vmalloc fallbacks. Do we want to weaken them in
favor of the vmalloc in general (note that it is not clear from the size
whether they are costly or !costly)?

I have looked at some random others and they are adding the flag without
any explanation so it is not really clear what was the motivation.

To me it seems like the flag is used quite randomly. I suspect there are
many places which do not have that flag just because nobody bothered to
report the allocation failure which is hard to reproduce.

> If we made __GFP_REPEAT then the default,
> the sites that would then try harder are the same sites that would now
> emit page allocation failure warnings. These are rare, and the only
> times I have seen them is under enough load that latency is shot to
> hell anyway. So I'm not really convinced by the regression argument.

You do not need to be under a heavy load to fail those allocations. It
is sufficient to have the memory fragmented which might be just a matter
of time. I am worried that we have hard to examine number of allocation
requests that might change the overall system behavior because they
might trigger more reclaim/swap and the source of the behavior change
wouldn't be quite obvious. On the other hand we already have some places
already annotated to require a more effort which is the reason I would
find it better to follow up with that.

High order allocations can be really expensive and the current behavior
with the allocation warning has an advantage that we can see the failure
mode and get a bug report with the exact trace (hopefully) without too
much of a background interference. Then the subsystem familiar person
can judge whether that particular allocation is worth more effort or
different fallback.

I am not saying that changing the default behavior for costly
allocations is a no go. I just feel it is too risky and it would be
better to use "override the default because I know what I am doing"
flag. The __GFP_RETRY_HARD might be a terrible name and a better name
would cause less confusion (__GFP_RETRY_MAYFAIL?).

> But that would *actually* clean up the flags, not make them even more
> confusing:
> 
> Allocations that can't ever handle failure would use __GFP_NOFAIL.
> 
> Callers like XFS would use __GFP_MAYFAIL specifically to disable the
> implicit __GFP_NOFAIL of !costly allocations.
>
> Callers that would prefer falling back over killing and looping would
> use __GFP_NORETRY.
> 
> Wouldn't that cover all usecases and be much more intuitive, both in
> the default behavior as well as in the names of the flags?

How do we describe kcompacd vs. page fault THP allocations? We do not
want to cause a lot of reclaim for those but we can wait for compaction
for the first while we would prefer not to for the later.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic
  2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
  2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko
  2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko
@ 2016-10-06 11:14 ` Michal Hocko
  2 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-10-06 11:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner,
	Rik van Riel, Dave Chinner, LKML

Hi,
I would like to revive this thread. It seems we haven't found any
conclusion. The main objection so far has been that the __GFP_RETRY_HARD
may-fail semantic is not clear and that it is mixing two things (fail
for !costly and repeat for costly orders) together. As for the first
part I guess we can think of a better name. __GFP_RETRY_MAYFAIL should be
more clear IMHO.

For the second objection Johannes has suggested a separate __GFP_MAYFAIL
flag. In one variant we would keep __GFP_REPEAT as a modifier to try
harder on top. My main objection was that the gfp flag space is really
scarce but we received some relief after
http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz so maybe
we can reconsider this.

I am not convinced that a separate MAYFAIL flag would make situation
much better semantic wise, though. a) it adds de-facto another gfp
order specific flag and we have seen that this concept is not really
working well from the past and b) the name doesn't really help users
to make clear how hard the allocator tries if we keep GFP_REPEAT -
e.g. what is the difference between GFP_KERNEL|__GFP_MAYFAIL and
GFP_KERNEL|__GFP_MAYFAIL|__GFP_REPEAT?

One way to differentiate the two would be to not trigger the OOM killer
in the first case which would make it similar to GFP_NORETRY except we
would retry as long as there is a progress and fail right before the
OOM would be declared. __GFP_REPEAT on top would invoke OOM for !costly
requests and keep retrying until we hit the OOM again in the same
path. To me it sounds quite complicated and the OOM behavior subtle and
non intuitive. An alternative would be to ignore __GFP_REPEAT for
!costly orders and have it costly specific.

Another suggestion by Johannes was to make __GFP_REPEAT default for
costly orders and drop the flag. __GFP_NORETRY can be used to opt-out
and have the previous behavior. __GFP_MAYFAIL would then be more clear
but the main problem then is that many allocation requests could turn
out to be really disruptive leading to performance issues which would
be not that easy to debug. While we only have few GFP_REPEAT users
(~30) currently there are way much more allocations which would be more
aggressive now. So the question is how to identify those which should be
changed to NORETRY and how to tell users when to opt-out in newly added
allocation sites. To me this sounds really risky and whack-a-mole games.

Does anybody see other option(s)?

I fully realize the situation is not easy and our costly vs. !costly
heritage is hard to come around. Whatever we do can end up in a similar
mess we had with GFP_REPEAT but I believe that we should find a way
to tell that a particular request is OK to fail and override default
allocator policy. To me __GFP_RETRY_MAYFAIL with "retry as long as there
are reasonable chances to succeed and then fail" semantic sounds the
most coherent and attractive because it is not order specific from the
user point of view and there is a zero risk of reggressions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-10-06 11:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko
2016-06-07 12:11   ` Tetsuo Handa
2016-06-07 12:31     ` Michal Hocko
2016-06-11 14:35       ` Tetsuo Handa
2016-06-13 11:37         ` Michal Hocko
2016-06-13 14:54           ` Tetsuo Handa
2016-06-13 15:17             ` Michal Hocko
2016-06-14 11:12               ` Tetsuo Handa
2016-06-14 18:54                 ` Michal Hocko
2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko
2016-06-16  0:23   ` Dave Chinner
2016-06-16  8:03     ` Michal Hocko
2016-06-16 11:26       ` Michal Hocko
2016-06-17 18:22         ` Johannes Weiner
2016-06-17 20:30           ` Vlastimil Babka
2016-06-17 21:39             ` Johannes Weiner
2016-06-20  8:08               ` Michal Hocko
2016-06-21  4:22                 ` Johannes Weiner
2016-06-21  9:29                   ` Vlastimil Babka
2016-06-21 17:00                   ` Michal Hocko
2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).