* [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic @ 2016-06-06 11:32 Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML Hi, this is a follow up for __GFP_REPEAT clean up merged into mmotm just recently [1]. The main motivation for the change is that the current implementation of __GFP_REPEAT is not very much useful. The documentation says: * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt * _might_ fail. This depends upon the particular VM implementation. It just fails to mention that this is true only for large (costly) high order which has been the case since the flag was introduced. A similar semantic would be really helpful for smal orders as well, though, because we have places where a failure with a specific fallback error handling is preferred to a potential endless loop inside the page allocator. The cleanup [1] dropped __GFP_REPEAT usage for low (!costly) order users so only those which might use larger orders have stayed. Let's rename the flag to something more verbose and use it for existing users. Semantic for those will not change. Then implement low (!costly) orders failure path which is hit after the page allocator is about to hit the oom killer path again. That means that the first OOM killer invocation and all the retries after then haven't helped to move on. This seems like a good indication that any further progress is highly unlikely. Xfs code already has an existing annotation for allocations which are allowed to fail and we can trivially map them to the new gfp flag because it will provide the semantic KM_MAYFAIL wants. I assume we will grow more users - e.g. GFP_USER sounds like it could use the flag by default. But I haven't explored this path properly yet. I am sending this as an RFC and would like to hear back about the approach. We have discussed this at LSF this year and there were different ideas how to achieve the semantic. I have decided to go __GFP_RETRY_HARD way because it nicely fits into the existing scheme where __GFP_NORETRY and __GFP_NOFAIL already modify the default behavior of the page allocator and the new flag would fit nicely between the two existing flags. The patch 1 is much more verbose about different modes of operation of the page allocator. Thanks --- [1] http://lkml.kernel.org/r/1464599699-30131-1-git-send-email-mhocko@kernel.org ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko @ 2016-06-06 11:32 ` Michal Hocko 2016-06-07 12:11 ` Tetsuo Handa 2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko 2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko 2 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_HARD which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail (unless the requesting task is killed by the OOM killer) and costly will fail early rather than cause disruptive reclaim. - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). No OOM killer is invoked. - GFP_KERNEL | __GFP_RETRY_HARD - overrides the default allocator behavior and all allocation requests try really hard, !costly are allowed to invoke OOM killer. The request will fail if no progress is expected. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_HARD because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_HARD if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted and retrying doesn't make much sense most probably. Signed-off-by: Michal Hocko <mhocko@suse.com> --- Documentation/DMA-ISA-LPC.txt | 2 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/mmc/host/wbsd.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/target/target_core_transport.c | 2 +- drivers/vhost/net.c | 2 +- drivers/vhost/scsi.c | 2 +- drivers/vhost/vhost.c | 2 +- fs/btrfs/check-integrity.c | 2 +- fs/btrfs/raid56.c | 2 +- include/linux/gfp.h | 32 +++++++++++++++++++--------- include/linux/slab.h | 3 ++- include/trace/events/mmflags.h | 2 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 ++-- mm/internal.h | 2 +- mm/page_alloc.c | 19 ++++++++++++++--- mm/sparse-vmemmap.c | 4 ++-- mm/vmscan.c | 8 +++---- net/core/dev.c | 6 +++--- net/core/skbuff.c | 2 +- net/sched/sch_fq.c | 2 +- tools/perf/builtin-kmem.c | 2 +- 25 files changed, 69 insertions(+), 43 deletions(-) diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt index b1a19835e907..5b594dfb1783 100644 --- a/Documentation/DMA-ISA-LPC.txt +++ b/Documentation/DMA-ISA-LPC.txt @@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc. Unfortunately the memory available for ISA DMA is scarce so unless you allocate the memory during boot-up it's a good idea to also pass -__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder. +__GFP_RETRY_HARD and __GFP_NOWARN to make the allocater try a bit harder. (This scarcity also means that you should allocate the buffer as early as possible and not release it until the driver is unloaded.) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index d14fcf82c00c..be3b996915a9 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) return (pgd_t *)__get_free_page(PGALLOC_GFP); #else struct page *page; - page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4); + page = alloc_pages(PGALLOC_GFP | __GFP_RETRY_HARD, 4); if (!page) return NULL; return (pgd_t *) page_address(page); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 05f09ae82587..204484fbda51 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -72,7 +72,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) /* Lastly try successively smaller sizes from the page allocator */ /* Only do this if userspace didn't specify a size via ioctl */ while (!hpt && order > PPC_MIN_HPT_ORDER && !htab_orderp) { - hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT| + hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_HARD| __GFP_NOWARN, order - PAGE_SHIFT); if (!hpt) --order; diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index ca13df854639..2633e1c32a45 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -2029,7 +2029,7 @@ static int blkif_recover(struct blkfront_info *info) rinfo = &info->rinfo[r_index]; /* Stage 1: Make a safe copy of the shadow state. */ copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow), - GFP_NOIO | __GFP_REPEAT | __GFP_HIGH); + GFP_NOIO | __GFP_RETRY_HARD | __GFP_HIGH); if (!copy) return -ENOMEM; diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c index c3fd16d997ca..cb71a383c4ec 100644 --- a/drivers/mmc/host/wbsd.c +++ b/drivers/mmc/host/wbsd.c @@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma) * order for ISA to be able to DMA to it. */ host->dma_buffer = kmalloc(WBSD_DMA_SIZE, - GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN); + GFP_NOIO | GFP_DMA | __GFP_RETRY_HARD | __GFP_NOWARN); if (!host->dma_buffer) goto free; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 2a67b496a9e2..d5ecf3007ac6 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count, } if (!session->response) session->response = (char *)__get_free_pages(GFP_KERNEL - | __GFP_REPEAT | GFP_DMA, + | __GFP_RETRY_HARD | GFP_DMA, get_order(session->bufsize)); if (!session->response) { mutex_unlock(&session->mutex); diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 5ab3967dda43..6102177b2c7b 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -251,7 +251,7 @@ int transport_alloc_session_tags(struct se_session *se_sess, int rc; se_sess->sess_cmd_map = kzalloc(tag_num * tag_size, - GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!se_sess->sess_cmd_map) { se_sess->sess_cmd_map = vzalloc(tag_num * tag_size); if (!se_sess->sess_cmd_map) { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f744eeb3e2b4..820a4715eb38 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -747,7 +747,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) struct vhost_virtqueue **vqs; int i; - n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!n) { n = vmalloc(sizeof *n); if (!n) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 9d6320e8ff3e..7150b45e092f 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1405,7 +1405,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) struct vhost_virtqueue **vqs; int r = -ENOMEM, i; - vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!vs) { vs = vzalloc(sizeof(*vs)); if (!vs) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 669fef1e2bb6..a4b0f18a69ab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2) static void *vhost_kvzalloc(unsigned long size) { - void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!n) n = vzalloc(size); diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index b677a6ea6001..f8658fbbfa60 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -3049,7 +3049,7 @@ int btrfsic_mount(struct btrfs_root *root, root->sectorsize, PAGE_SIZE); return -1; } - state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!state) { state = vzalloc(sizeof(*state)); if (!state) { diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index f8b6d411a034..db2bba5cb90a 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -218,7 +218,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info) * of a failing mount. */ table_size = sizeof(*table) + sizeof(*h) * num_entries; - table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!table) { table = vzalloc(table_size); if (!table) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c29e9d347bc6..9961086eac2e 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -25,7 +25,7 @@ struct vm_area_struct; #define ___GFP_FS 0x80u #define ___GFP_COLD 0x100u #define ___GFP_NOWARN 0x200u -#define ___GFP_REPEAT 0x400u +#define ___GFP_RETRY_HARD 0x400u #define ___GFP_NOFAIL 0x800u #define ___GFP_NORETRY 0x1000u #define ___GFP_MEMALLOC 0x2000u @@ -132,26 +132,38 @@ struct vm_area_struct; * * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim. * - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt - * _might_ fail. This depends upon the particular VM implementation. + * The default allocator behavior depends on the request size. We have a concept + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER). + * !costly allocations are too essential to fail so they are implicitly + * non-failing (with some exceptions like OOM victims might fail) by default while + * costly requests try to be not disruptive and back off even without invoking + * the OOM killer. The following three modifiers might be used to override some of + * these implicit rules + * + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will + * return NULL when direct reclaim and memory compaction have failed to allow + * the allocation to succeed. The OOM killer is not called with the current + * implementation. This is a default mode for costly allocations. + * + * __GFP_RETRY_HARD: Try hard to allocate the memory, but the allocation attempt + * _might_ fail. All viable forms of memory reclaim are tried before the fail + * including the OOM killer for !costly allocations. This can be used to override + * non-failing default behavior for !costly requests as well as fortify costly + * requests. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. New users should be evaluated carefully * (and the flag should be used only when there is no reasonable failure * policy) but it is definitely preferable to use the flag rather than - * opencode endless loop around allocator. - * - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will - * return NULL when direct reclaim and memory compaction have failed to allow - * the allocation to succeed. The OOM killer is not called with the current - * implementation. + * opencode endless loop around allocator. Using this flag for costly allocations + * is _highly_ discouraged. */ #define __GFP_IO ((__force gfp_t)___GFP_IO) #define __GFP_FS ((__force gfp_t)___GFP_FS) #define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */ #define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */ #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM)) -#define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) +#define __GFP_RETRY_HARD ((__force gfp_t)___GFP_RETRY_HARD) #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) diff --git a/include/linux/slab.h b/include/linux/slab.h index aeb3e6d00a66..cacd437fdbf4 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -457,7 +457,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) * * %__GFP_NOWARN - If allocation fails, don't issue any warnings. * - * %__GFP_REPEAT - If allocation fails initially, try once more before failing. + * %__GFP_RETRY_HARD - Try really hard to succeed the allocation but fail + * eventually. * * There are other flags available as well, but these are not intended * for general use, and so are not documented here. For a full list of diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 43cedbf0c759..c5f488767860 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -30,7 +30,7 @@ {(unsigned long)__GFP_FS, "__GFP_FS"}, \ {(unsigned long)__GFP_COLD, "__GFP_COLD"}, \ {(unsigned long)__GFP_NOWARN, "__GFP_NOWARN"}, \ - {(unsigned long)__GFP_REPEAT, "__GFP_REPEAT"}, \ + {(unsigned long)__GFP_RETRY_HARD, "__GFP_RETRY_HARD"}, \ {(unsigned long)__GFP_NOFAIL, "__GFP_NOFAIL"}, \ {(unsigned long)__GFP_NORETRY, "__GFP_NORETRY"}, \ {(unsigned long)__GFP_COMP, "__GFP_COMP"}, \ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index acd374e200cf..69872951f653 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -378,7 +378,7 @@ static ssize_t single_flag_store(struct kobject *kobj, /* * Currently defrag only disables __GFP_NOWAIT for allocation. A blind - * __GFP_REPEAT is too aggressive, it's never worth swapping tons of + * __GFP_RETRY_HARD is too aggressive, it's never worth swapping tons of * memory just to allocate one more hugepage. */ static ssize_t defrag_show(struct kobject *kobj, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e197cd7080e6..62306b5a302a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1363,7 +1363,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) page = __alloc_pages_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| - __GFP_REPEAT|__GFP_NOWARN, + __GFP_RETRY_HARD|__GFP_NOWARN, huge_page_order(h)); if (page) { prep_new_huge_page(h, page, nid); @@ -1480,7 +1480,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, struct vm_area_struct *vma, unsigned long addr, int nid) { int order = huge_page_order(h); - gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; + gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_HARD|__GFP_NOWARN; unsigned int cpuset_mems_cookie; /* diff --git a/mm/internal.h b/mm/internal.h index 420bbe300bcd..083c87c539b6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -23,7 +23,7 @@ * hints such as HIGHMEM usage. */ #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\ - __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\ + __GFP_NOWARN|__GFP_RETRY_HARD|__GFP_NOFAIL|\ __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC) /* The GFP flags allowed during early boot */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 180f5afc5a1f..faa3d4a27850 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3262,7 +3262,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, return compaction_zonelist_suitable(ac, order, alloc_flags); /* - * !costly requests are much more important than __GFP_REPEAT + * !costly requests are much more important than __GFP_RETRY_HARD * costly ones because they are de facto nofail and invoke OOM * killer to move on while costly can fail and users are ready * to cope with that. 1/4 retries is rather arbitrary but we @@ -3550,6 +3550,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum compact_result compact_result; int compaction_retries = 0; int no_progress_loops = 0; + bool passed_oom = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3680,9 +3681,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, /* * Do not retry costly high order allocations unless they are - * __GFP_REPEAT + * __GFP_RETRY_HARD */ - if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_HARD)) goto noretry; /* @@ -3711,6 +3712,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, compaction_retries)) goto retry; + /* + * We have already exhausted all our reclaim opportunities including + * the OOM killer without any success so it is time to admit defeat. + * We do not care about the order because we want all orders to behave + * consistently including !costly ones. costly are handled in + * __alloc_pages_may_oom and will bail out even before the first OOM + * killer invocation + */ + if (passed_oom && (gfp_mask & __GFP_RETRY_HARD)) + goto nopage; + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, /* Retry as long as the OOM killer is making progress */ if (did_some_progress) { no_progress_loops = 0; + passed_oom = true; goto retry; } diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 68885dcbaf40..261facd8e1c8 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node) if (node_state(node, N_HIGH_MEMORY)) page = alloc_pages_node( - node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, + node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_HARD, get_order(size)); else page = alloc_pages( - GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, + GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_HARD, get_order(size)); if (page) return page_address(page); diff --git a/mm/vmscan.c b/mm/vmscan.c index 93ba33789ac6..ff21efe06430 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2319,18 +2319,18 @@ static inline bool should_continue_reclaim(struct zone *zone, return false; /* Consider stopping depending on scan and reclaim activity */ - if (sc->gfp_mask & __GFP_REPEAT) { + if (sc->gfp_mask & __GFP_RETRY_HARD) { /* - * For __GFP_REPEAT allocations, stop reclaiming if the + * For __GFP_RETRY_HARD allocations, stop reclaiming if the * full LRU list has been scanned and we are still failing * to reclaim pages. This full LRU scan is potentially - * expensive but a __GFP_REPEAT caller really wants to succeed + * expensive but a __GFP_RETRY_HARD caller really wants to succeed */ if (!nr_reclaimed && !nr_scanned) return false; } else { /* - * For non-__GFP_REPEAT allocations which can presumably + * For non-__GFP_RETRY_HARD allocations which can presumably * fail without consequence, stop if we failed to reclaim * any pages from the last SWAP_CLUSTER_MAX number of * pages that were scanned. This will return to the diff --git a/net/core/dev.c b/net/core/dev.c index 904ff431d570..8a916dd1d833 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6897,7 +6897,7 @@ static int netif_alloc_rx_queues(struct net_device *dev) BUG_ON(count < 1); - rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!rx) { rx = vzalloc(sz); if (!rx) @@ -6939,7 +6939,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev) if (count < 1 || count > 0xffff) return -EINVAL; - tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!tx) { tx = vzalloc(sz); if (!tx) @@ -7477,7 +7477,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, /* ensure 32-byte alignment of whole construct */ alloc_size += NETDEV_ALIGN - 1; - p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); if (!p) p = vzalloc(alloc_size); if (!p) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e7ec6d3ad5f0..c4c667a7b39b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4621,7 +4621,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, gfp_head = gfp_mask; if (gfp_head & __GFP_DIRECT_RECLAIM) - gfp_head |= __GFP_REPEAT; + gfp_head |= __GFP_RETRY_HARD; *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_head); diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 3c6a47d66a04..7ed2224df628 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -599,7 +599,7 @@ static void *fq_alloc_node(size_t sz, int node) { void *ptr; - ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node); + ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_RETRY_HARD | __GFP_NOWARN, node); if (!ptr) ptr = vmalloc_node(sz, node); return ptr; diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index 58adfee230de..a63f8f4e6a6a 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -627,7 +627,7 @@ static const struct { { "__GFP_FS", "F" }, { "__GFP_COLD", "CO" }, { "__GFP_NOWARN", "NWR" }, - { "__GFP_REPEAT", "R" }, + { "__GFP_RETRY_HARD", "R" }, { "__GFP_NOFAIL", "NF" }, { "__GFP_NORETRY", "NR" }, { "__GFP_COMP", "C" }, -- 2.8.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko @ 2016-06-07 12:11 ` Tetsuo Handa 2016-06-07 12:31 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Tetsuo Handa @ 2016-06-07 12:11 UTC (permalink / raw) To: Michal Hocko, linux-mm Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML, Michal Hocko On 2016/06/06 20:32, Michal Hocko wrote: > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 669fef1e2bb6..a4b0f18a69ab 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2) > > static void *vhost_kvzalloc(unsigned long size) > { > - void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); > + void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); Remaining __GFP_REPEAT users are not always doing costly allocations. Sometimes they pass __GFP_REPEAT because the size is given from userspace. Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. What I think more important is hearing from __GFP_REPEAT users how hard they want to retry. It is possible that they want to retry unless SIGKILL is delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE if __GFP_KILLABLE were available. In my module (though I'm not using __GFP_REPEAT), I want to retry unless SIGKILL is delivered. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 180f5afc5a1f..faa3d4a27850 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3262,7 +3262,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, > return compaction_zonelist_suitable(ac, order, alloc_flags); > > /* > - * !costly requests are much more important than __GFP_REPEAT > + * !costly requests are much more important than __GFP_RETRY_HARD > * costly ones because they are de facto nofail and invoke OOM > * killer to move on while costly can fail and users are ready > * to cope with that. 1/4 retries is rather arbitrary but we > @@ -3550,6 +3550,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > enum compact_result compact_result; > int compaction_retries = 0; > int no_progress_loops = 0; > + bool passed_oom = false; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -3680,9 +3681,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > /* > * Do not retry costly high order allocations unless they are > - * __GFP_REPEAT > + * __GFP_RETRY_HARD > */ > - if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) > + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_HARD)) > goto noretry; > > /* > @@ -3711,6 +3712,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > compaction_retries)) > goto retry; > > + /* > + * We have already exhausted all our reclaim opportunities including > + * the OOM killer without any success so it is time to admit defeat. > + * We do not care about the order because we want all orders to behave > + * consistently including !costly ones. costly are handled in > + * __alloc_pages_may_oom and will bail out even before the first OOM > + * killer invocation > + */ > + if (passed_oom && (gfp_mask & __GFP_RETRY_HARD)) > + goto nopage; > + If __GFP_REPEAT was passed because the size is not known at compile time, this will break "!costly allocations will retry unless TIF_MEMDIE is set" behavior. > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > if (page) > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > /* Retry as long as the OOM killer is making progress */ > if (did_some_progress) { > no_progress_loops = 0; > + passed_oom = true; This is too premature. did_some_progress != 0 after returning from __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means that mutex_trylock(&oom_lock) was attempted. It is possible that somebody else is on the way to call out_of_memory(). It is possible that the OOM reaper is about to start reaping memory. Giving up after 1 jiffie of sleep is too fast. > goto retry; > } > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-07 12:11 ` Tetsuo Handa @ 2016-06-07 12:31 ` Michal Hocko 2016-06-11 14:35 ` Tetsuo Handa 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-07 12:31 UTC (permalink / raw) To: Tetsuo Handa Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML On Tue 07-06-16 21:11:03, Tetsuo Handa wrote: > On 2016/06/06 20:32, Michal Hocko wrote: > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 669fef1e2bb6..a4b0f18a69ab 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -707,7 +707,7 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2) > > > > static void *vhost_kvzalloc(unsigned long size) > > { > > - void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); > > + void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_HARD); > > Remaining __GFP_REPEAT users are not always doing costly allocations. Yes but... > Sometimes they pass __GFP_REPEAT because the size is given from userspace. > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. Would that be a regression though? Strictly speaking the __GFP_REPEAT documentation was explicit to not loop for ever. So nobody should have expected nofail semantic pretty much by definition. The fact that our previous implementation was not fully conforming to the documentation is just an implementation detail. All the remaining users of __GFP_REPEAT _have_ to be prepared for the allocation failure. So what exactly is the problem with them? > What I think more important is hearing from __GFP_REPEAT users how hard they > want to retry. It is possible that they want to retry unless SIGKILL is > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE > if __GFP_KILLABLE were available. In my module (though I'm not using > __GFP_REPEAT), I want to retry unless SIGKILL is delivered. To be honest killability for a particular allocation request sounds like a hack to me. Just consider the expected semantic. How do you handle when one path uses explicit __GFP_KILLABLE while other path (from the same syscall) is not... If anything this would have to be process context wise. [...] > > /* Reclaim has failed us, start killing things */ > > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > > if (page) > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > /* Retry as long as the OOM killer is making progress */ > > if (did_some_progress) { > > no_progress_loops = 0; > > + passed_oom = true; > > This is too premature. did_some_progress != 0 after returning from > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means > that mutex_trylock(&oom_lock) was attempted. which means that we have reached the OOM condition and _somebody_ is actaully handling the OOM on our behalf. > It is possible that somebody else > is on the way to call out_of_memory(). It is possible that the OOM reaper is > about to start reaping memory. Giving up after 1 jiffie of sleep is too fast. Sure this will always be racy. But the primary point is that we have passed the OOM line and then passed through all the retries to get to the same state again. This sounds like a pretty natural boundary to tell we have tried hard enough to rather fail and let the caller handle the fallback. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-07 12:31 ` Michal Hocko @ 2016-06-11 14:35 ` Tetsuo Handa 2016-06-13 11:37 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Tetsuo Handa @ 2016-06-11 14:35 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel Michal Hocko wrote: > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote: > > Remaining __GFP_REPEAT users are not always doing costly allocations. > > Yes but... > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace. > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. > > Would that be a regression though? Strictly speaking the __GFP_REPEAT > documentation was explicit to not loop for ever. So nobody should have > expected nofail semantic pretty much by definition. The fact that our > previous implementation was not fully conforming to the documentation is > just an implementation detail. All the remaining users of __GFP_REPEAT > _have_ to be prepared for the allocation failure. So what exactly is the > problem with them? A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed. > > > /* Reclaim has failed us, start killing things */ > > > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > > > if (page) > > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > > /* Retry as long as the OOM killer is making progress */ > > > if (did_some_progress) { > > > no_progress_loops = 0; > > > + passed_oom = true; > > > > This is too premature. did_some_progress != 0 after returning from > > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means > > that mutex_trylock(&oom_lock) was attempted. > > which means that we have reached the OOM condition and _somebody_ is > actaully handling the OOM on our behalf. That _somebody_ might release oom_lock without invoking the OOM killer (e.g. doing !__GFP_FS allocation), which means that we have reached the OOM condition and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes as weak as __GFP_NORETRY. I think this is a regression. > > What I think more important is hearing from __GFP_REPEAT users how hard they > > want to retry. It is possible that they want to retry unless SIGKILL is > > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT > > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE > > if __GFP_KILLABLE were available. In my module (though I'm not using > > __GFP_REPEAT), I want to retry unless SIGKILL is delivered. > > To be honest killability for a particular allocation request sounds > like a hack to me. Just consider the expected semantic. How do you > handle when one path uses explicit __GFP_KILLABLE while other path (from > the same syscall) is not... If anything this would have to be process > context wise. I didn't catch your question. But making code killable should be considered good unless it complicates error handling paths. Since we are not setting TIF_MEMDIE to all OOM-killed threads, OOM-killed threads will have to loop until mutex_trylock(&oom_lock) succeeds in order to get TIF_MEMDIE by calling out_of_memory(), which is a needless delay. Many allocations from syscall context can give up upon SIGKILL. We don't need to allow OOM-killed threads to use memory reserves if that allocation is killable. Converting down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem) is considered good. But converting kmalloc(GFP_KERNEL) to kmalloc(GFP_KERNEL | __GFP_KILLABLE) is considered hack. Why? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-11 14:35 ` Tetsuo Handa @ 2016-06-13 11:37 ` Michal Hocko 2016-06-13 14:54 ` Tetsuo Handa 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-13 11:37 UTC (permalink / raw) To: Tetsuo Handa Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel On Sat 11-06-16 23:35:49, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote: > > > Remaining __GFP_REPEAT users are not always doing costly allocations. > > > > Yes but... > > > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace. > > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. > > > > Would that be a regression though? Strictly speaking the __GFP_REPEAT > > documentation was explicit to not loop for ever. So nobody should have > > expected nofail semantic pretty much by definition. The fact that our > > previous implementation was not fully conforming to the documentation is > > just an implementation detail. All the remaining users of __GFP_REPEAT > > _have_ to be prepared for the allocation failure. So what exactly is the > > problem with them? > > A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed. That is true. But it is not weaker than the __GFP_REPEAT actually ever promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The fact that we have ignored it is sad but that is what I am trying to address here. > > > > /* Reclaim has failed us, start killing things */ > > > > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > > > > if (page) > > > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > > > /* Retry as long as the OOM killer is making progress */ > > > > if (did_some_progress) { > > > > no_progress_loops = 0; > > > > + passed_oom = true; > > > > > > This is too premature. did_some_progress != 0 after returning from > > > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means > > > that mutex_trylock(&oom_lock) was attempted. > > > > which means that we have reached the OOM condition and _somebody_ is > > actaully handling the OOM on our behalf. > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g. > doing !__GFP_FS allocation), which means that we have reached the OOM condition > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes > as weak as __GFP_NORETRY. I think this is a regression. I really fail to see your point. We are talking about a gfp flag which tells the allocator to retry as much as it is feasible. Getting through all the reclaim attempts two times without any progress sounds like a fair criterion. Well, we could try $NUM times but that wouldn't make too much difference to what you are writing above. The fact whether somebody has been killed or not is not really that important IMHO. > > > What I think more important is hearing from __GFP_REPEAT users how hard they > > > want to retry. It is possible that they want to retry unless SIGKILL is > > > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT > > > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE > > > if __GFP_KILLABLE were available. In my module (though I'm not using > > > __GFP_REPEAT), I want to retry unless SIGKILL is delivered. > > > > To be honest killability for a particular allocation request sounds > > like a hack to me. Just consider the expected semantic. How do you > > handle when one path uses explicit __GFP_KILLABLE while other path (from > > the same syscall) is not... If anything this would have to be process > > context wise. > > I didn't catch your question. But making code killable should be considered > good unless it complicates error handling paths. What I meant was this kmalloc(GFP_KILLABLE) func1 kmalloc(GFP_KERNEL) is still not killable context because whatever you call (func1) might have a different view about killability. So the per allocation context will not work reliably. > Since we are not setting TIF_MEMDIE to all OOM-killed threads, OOM-killed > threads will have to loop until mutex_trylock(&oom_lock) succeeds in order > to get TIF_MEMDIE by calling out_of_memory(), which is a needless delay. > > Many allocations from syscall context can give up upon SIGKILL. We don't > need to allow OOM-killed threads to use memory reserves if that allocation > is killable. > > Converting down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem) > is considered good. But converting kmalloc(GFP_KERNEL) to > kmalloc(GFP_KERNEL | __GFP_KILLABLE) is considered hack. Why? Because unblocking the killable context is meant to help others who want to take the lock to make a forward progress. While the killable allocation context is only about the particular allocation to fail when the task is killed. There is no direct resource to release. So unless all the allocation in the same scope are killable this will not help anything. See my point? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-13 11:37 ` Michal Hocko @ 2016-06-13 14:54 ` Tetsuo Handa 2016-06-13 15:17 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Tetsuo Handa @ 2016-06-13 14:54 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel Michal Hocko wrote: > On Sat 11-06-16 23:35:49, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote: > > > > Remaining __GFP_REPEAT users are not always doing costly allocations. > > > > > > Yes but... > > > > > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace. > > > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. > > > > > > Would that be a regression though? Strictly speaking the __GFP_REPEAT > > > documentation was explicit to not loop for ever. So nobody should have > > > expected nofail semantic pretty much by definition. The fact that our > > > previous implementation was not fully conforming to the documentation is > > > just an implementation detail. All the remaining users of __GFP_REPEAT > > > _have_ to be prepared for the allocation failure. So what exactly is the > > > problem with them? > > > > A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed. > > That is true. But it is not weaker than the __GFP_REPEAT actually ever > promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The > fact that we have ignored it is sad but that is what I am trying to > address here. Whatever you rename __GFP_REPEAT to, it sounds strange to me that !costly __GFP_REPEAT allocations are weaker than !costly !__GFP_REPEAT allocations. Are you planning to make !costly !__GFP_REPEAT allocations to behave like __GFP_NORETRY? > > > > > > /* Reclaim has failed us, start killing things */ > > > > > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > > > > > if (page) > > > > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > > > > /* Retry as long as the OOM killer is making progress */ > > > > > if (did_some_progress) { > > > > > no_progress_loops = 0; > > > > > + passed_oom = true; > > > > > > > > This is too premature. did_some_progress != 0 after returning from > > > > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means > > > > that mutex_trylock(&oom_lock) was attempted. > > > > > > which means that we have reached the OOM condition and _somebody_ is > > > actaully handling the OOM on our behalf. > > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g. > > doing !__GFP_FS allocation), which means that we have reached the OOM condition > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes > > as weak as __GFP_NORETRY. I think this is a regression. > > I really fail to see your point. We are talking about a gfp flag which > tells the allocator to retry as much as it is feasible. Getting through > all the reclaim attempts two times without any progress sounds like a > fair criterion. Well, we could try $NUM times but that wouldn't make too > much difference to what you are writing above. The fact whether somebody > has been killed or not is not really that important IMHO. If all the reclaim attempt first time made no progress, all the reclaim attempt second time unlikely make progress unless the OOM killer kills something. Thus, doing all the reclaim attempts two times without any progress without killing somebody sounds almost equivalent to doing all the reclaim attempt only once. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-13 14:54 ` Tetsuo Handa @ 2016-06-13 15:17 ` Michal Hocko 2016-06-14 11:12 ` Tetsuo Handa 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-13 15:17 UTC (permalink / raw) To: Tetsuo Handa Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel On Mon 13-06-16 23:54:13, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 11-06-16 23:35:49, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Tue 07-06-16 21:11:03, Tetsuo Handa wrote: > > > > > Remaining __GFP_REPEAT users are not always doing costly allocations. > > > > > > > > Yes but... > > > > > > > > > Sometimes they pass __GFP_REPEAT because the size is given from userspace. > > > > > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good. > > > > > > > > Would that be a regression though? Strictly speaking the __GFP_REPEAT > > > > documentation was explicit to not loop for ever. So nobody should have > > > > expected nofail semantic pretty much by definition. The fact that our > > > > previous implementation was not fully conforming to the documentation is > > > > just an implementation detail. All the remaining users of __GFP_REPEAT > > > > _have_ to be prepared for the allocation failure. So what exactly is the > > > > problem with them? > > > > > > A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed. > > > > That is true. But it is not weaker than the __GFP_REPEAT actually ever > > promissed. __GFP_REPEAT explicitly said to not retry _for_ever_. The > > fact that we have ignored it is sad but that is what I am trying to > > address here. > > Whatever you rename __GFP_REPEAT to, it sounds strange to me that !costly > __GFP_REPEAT allocations are weaker than !costly !__GFP_REPEAT allocations. > Are you planning to make !costly !__GFP_REPEAT allocations to behave like > __GFP_NORETRY? The patch description tries to explain the difference: __GFP_NORETRY doesn't retry at all __GFP_RETRY_HARD retries as hard as feasible __GFP_NOFAIL tells the retry for ever all of them regardless of the order. This is the way how to tell the allocator to change its default behavior which might be, and actually is, different depending on the order. [...] > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g. > > > doing !__GFP_FS allocation), which means that we have reached the OOM condition > > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes > > > as weak as __GFP_NORETRY. I think this is a regression. > > > > I really fail to see your point. We are talking about a gfp flag which > > tells the allocator to retry as much as it is feasible. Getting through > > all the reclaim attempts two times without any progress sounds like a > > fair criterion. Well, we could try $NUM times but that wouldn't make too > > much difference to what you are writing above. The fact whether somebody > > has been killed or not is not really that important IMHO. > > If all the reclaim attempt first time made no progress, all the reclaim > attempt second time unlikely make progress unless the OOM killer kills > something. Thus, doing all the reclaim attempts two times without any progress > without killing somebody sounds almost equivalent to doing all the reclaim > attempt only once. Yes, that is possible. You might have a GFP_NOFS only load where nothing really invokes the OOM killer. Does that actually matter, though? The semantic of the flag is to retry hard while the page allocator believes it can make a forward progress. But not for ever. We never know whether a progress is possible at all. We have certain heuristics when to give up, try to invoke OOM killer and try again hoping things have changed. This is not much different except we declare that no hope to getting to the OOM point again without being able to succeed. Are you suggesting a more precise heuristic? Or do you claim that we do not need a flag which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL which are on the extreme sides? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-13 15:17 ` Michal Hocko @ 2016-06-14 11:12 ` Tetsuo Handa 2016-06-14 18:54 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Tetsuo Handa @ 2016-06-14 11:12 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel Michal Hocko wrote: > > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g. > > > > doing !__GFP_FS allocation), which means that we have reached the OOM condition > > > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes > > > > as weak as __GFP_NORETRY. I think this is a regression. > > > > > > I really fail to see your point. We are talking about a gfp flag which > > > tells the allocator to retry as much as it is feasible. Getting through > > > all the reclaim attempts two times without any progress sounds like a > > > fair criterion. Well, we could try $NUM times but that wouldn't make too > > > much difference to what you are writing above. The fact whether somebody > > > has been killed or not is not really that important IMHO. > > > > If all the reclaim attempt first time made no progress, all the reclaim > > attempt second time unlikely make progress unless the OOM killer kills > > something. Thus, doing all the reclaim attempts two times without any progress > > without killing somebody sounds almost equivalent to doing all the reclaim > > attempt only once. > > Yes, that is possible. You might have a GFP_NOFS only load where nothing > really invokes the OOM killer. Does that actually matter, though? The > semantic of the flag is to retry hard while the page allocator believes > it can make a forward progress. But not for ever. We never know whether > a progress is possible at all. We have certain heuristics when to give > up, try to invoke OOM killer and try again hoping things have changed. > This is not much different except we declare that no hope to getting to > the OOM point again without being able to succeed. Are you suggesting > a more precise heuristic? Or do you claim that we do not need a flag > which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL > which are on the extreme sides? Well, maybe we can get rid of __GFP_RETRY (or make __GFP_RETRY used for only huge pages). Many __GFP_RETRY users are ready to fall back to vmalloc(). We are not sure whether such __GFP_RETRY users want to retry with OOM-killing somebody (we don't have __GFP_MAY_OOM_KILL which explicitly asks for "retry with OOM-killing somebody"). If __GFP_RETRY means nothing but try once more, void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); if (!n) n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); will emulate it. ----- arch/powerpc/include/asm/book3s/64/pgalloc.h ----- static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) { #ifdef CONFIG_PPC_64K_PAGES return (pgd_t *)__get_free_page(PGALLOC_GFP); #else struct page *page; page = alloc_pages(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO | __GFP_REPEAT, 4); if (!page) return NULL; return (pgd_t *) page_address(page); #endif } ----- arch/powerpc/kvm/book3s_64_mmu_hv.c ----- kvm->arch.hpt_cma_alloc = 0; page = kvm_alloc_hpt(1ul << (order - PAGE_SHIFT)); if (page) { hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); memset((void *)hpt, 0, (1ul << order)); kvm->arch.hpt_cma_alloc = 1; } /* Lastly try successively smaller sizes from the page allocator */ /* Only do this if userspace didn't specify a size via ioctl */ while (!hpt && order > 18 && !htab_orderp) { hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT| __GFP_NOWARN, order - PAGE_SHIFT); if (!hpt) --order; } if (!hpt) return -ENOMEM; ----- drivers/vhost/vhost.c ----- static void *vhost_kvzalloc(unsigned long size) { void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!n) n = vzalloc(size); return n; } ----- drivers/vhost/scsi.c ----- vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!vs) { vs = vzalloc(sizeof(*vs)); if (!vs) goto err_vs; } ----- drivers/vhost/net.c ----- n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!n) { n = vmalloc(sizeof *n); if (!n) return -ENOMEM; } ----- drivers/block/xen-blkfront.c ----- /* Stage 1: Make a safe copy of the shadow state. */ copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow), GFP_NOIO | __GFP_REPEAT | __GFP_HIGH); if (!copy) return -ENOMEM; ----- drivers/mmc/host/wbsd.c ----- /* * We need to allocate a special buffer in * order for ISA to be able to DMA to it. */ host->dma_buffer = kmalloc(65536, GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN); if (!host->dma_buffer) goto free; ----- drivers/target/target_core_transport.c ----- se_sess->sess_cmd_map = kzalloc(tag_num * tag_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!se_sess->sess_cmd_map) { se_sess->sess_cmd_map = vzalloc(tag_num * tag_size); if (!se_sess->sess_cmd_map) { pr_err("Unable to allocate se_sess->sess_cmd_map\n"); return -ENOMEM; } } ----- drivers/s390/char/vmcp.c ----- if (mutex_lock_interruptible(&session->mutex)) { kfree(cmd); return -ERESTARTSYS; } if (!session->response) session->response = (char *)__get_free_pages(GFP_KERNEL | __GFP_REPEAT | GFP_DMA, get_order(session->bufsize)); if (!session->response) { mutex_unlock(&session->mutex); kfree(cmd); return -ENOMEM; } ----- fs/btrfs/raid56.c ----- table_size = sizeof(*table) + sizeof(*h) * num_entries; table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!table) { table = vzalloc(table_size); if (!table) return -ENOMEM; } ----- fs/btrfs/check-integrity.c ----- state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!state) { state = vzalloc(sizeof(*state)); if (!state) { printk(KERN_INFO "btrfs check-integrity: vzalloc() failed!\n"); return -1; } } ----- mm/sparse-vmemmap.c ----- void * __meminit vmemmap_alloc_block(unsigned long size, int node) { /* If the main allocator is up use that, fallback to bootmem. */ if (slab_is_available()) { struct page *page; if (node_state(node, N_HIGH_MEMORY)) page = alloc_pages_node( node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, get_order(size)); else page = alloc_pages( GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, get_order(size)); if (page) return page_address(page); return NULL; } else return __earlyonly_bootmem_alloc(node, size, size, __pa(MAX_DMA_ADDRESS)); } ----- mm/hugetlb.c ----- static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) { struct page *page; page = __alloc_pages_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); if (page) { prep_new_huge_page(h, page, nid); } return page; } static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, struct vm_area_struct *vma, unsigned long addr, int nid) { int order = huge_page_order(h); gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; unsigned int cpuset_mems_cookie; ----- net/core/skbuff.c ----- gfp_head = gfp_mask; if (gfp_head & __GFP_DIRECT_RECLAIM) gfp_head |= __GFP_REPEAT; *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_head); if (!skb) return NULL; ----- net/core/dev.c ----- rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!rx) { rx = vzalloc(sz); if (!rx) return -ENOMEM; } dev->_rx = rx; tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!tx) { tx = vzalloc(sz); if (!tx) return -ENOMEM; } dev->_tx = tx; p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); if (!p) p = vzalloc(alloc_size); if (!p) return NULL; ----- net/sched/sch_fq.c ----- static void *fq_alloc_node(size_t sz, int node) { void *ptr; ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node); if (!ptr) ptr = vmalloc_node(sz, node); return ptr; } ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic 2016-06-14 11:12 ` Tetsuo Handa @ 2016-06-14 18:54 ` Michal Hocko 0 siblings, 0 replies; 22+ messages in thread From: Michal Hocko @ 2016-06-14 18:54 UTC (permalink / raw) To: Tetsuo Handa Cc: linux-mm, akpm, mgorman, vbabka, hannes, riel, david, linux-kernel On Tue 14-06-16 20:12:08, Tetsuo Handa wrote: > Michal Hocko wrote: > > > > > That _somebody_ might release oom_lock without invoking the OOM killer (e.g. > > > > > doing !__GFP_FS allocation), which means that we have reached the OOM condition > > > > > and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes > > > > > as weak as __GFP_NORETRY. I think this is a regression. > > > > > > > > I really fail to see your point. We are talking about a gfp flag which > > > > tells the allocator to retry as much as it is feasible. Getting through > > > > all the reclaim attempts two times without any progress sounds like a > > > > fair criterion. Well, we could try $NUM times but that wouldn't make too > > > > much difference to what you are writing above. The fact whether somebody > > > > has been killed or not is not really that important IMHO. > > > > > > If all the reclaim attempt first time made no progress, all the reclaim > > > attempt second time unlikely make progress unless the OOM killer kills > > > something. Thus, doing all the reclaim attempts two times without any progress > > > without killing somebody sounds almost equivalent to doing all the reclaim > > > attempt only once. > > > > Yes, that is possible. You might have a GFP_NOFS only load where nothing > > really invokes the OOM killer. Does that actually matter, though? The > > semantic of the flag is to retry hard while the page allocator believes > > it can make a forward progress. But not for ever. We never know whether > > a progress is possible at all. We have certain heuristics when to give > > up, try to invoke OOM killer and try again hoping things have changed. > > This is not much different except we declare that no hope to getting to > > the OOM point again without being able to succeed. Are you suggesting > > a more precise heuristic? Or do you claim that we do not need a flag > > which would put a middle ground between __GFP_NORETRY and __GFP_NOFAIL > > which are on the extreme sides? > > Well, maybe we can get rid of __GFP_RETRY (or make __GFP_RETRY used for only > huge pages). Many __GFP_RETRY users are ready to fall back to vmalloc(). But some of them should try hard before they fall back to vmalloc. And let me repeat, there valid usecases when you want to to tell the allocator to not retry !costly requests for ever. > We are not sure whether such __GFP_RETRY users want to retry with OOM-killing > somebody (we don't have __GFP_MAY_OOM_KILL which explicitly asks for "retry > with OOM-killing somebody"). And we do not want something like __GFP_MAY_OOM_KILL. We have __GFP_NORETRY to tell to bail out early. We do not want callers to control the OOM behavior. That is an MM internal thing IMHO. > If __GFP_RETRY means nothing but try once more, > > void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); > if (!n) > n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); > > will emulate it. It won't. __GFP_NORETRY is way too weak because it only invokes optimistic compaction so the success rate would be really small even if you retry with the same flag multiple times in a row. We definitely need a stronger mode to tell that the allocator should really try hard before it fails. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko @ 2016-06-06 11:32 ` Michal Hocko 2016-06-16 0:23 ` Dave Chinner 2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko 2 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-06 11:32 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently so it relied on the default page allocator behavior for the given set of flags. This means that small allocations actually never failed. Now that we have __GFP_RETRY_HARD flags which works independently on the allocation request size we can map KM_MAYFAIL to it. The allocator will try as hard as it can to fulfill the request but fails eventually if the progress cannot be made. Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/xfs/kmem.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index 689f746224e7..34e6b062ce0e 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags) lflags &= ~__GFP_FS; } + if (flags & KM_MAYFAIL) + lflags |= __GFP_RETRY_HARD; + if (flags & KM_ZERO) lflags |= __GFP_ZERO; -- 2.8.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko @ 2016-06-16 0:23 ` Dave Chinner 2016-06-16 8:03 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2016-06-16 0:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, LKML, Michal Hocko On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently > so it relied on the default page allocator behavior for the given set > of flags. This means that small allocations actually never failed. > > Now that we have __GFP_RETRY_HARD flags which works independently on the > allocation request size we can map KM_MAYFAIL to it. The allocator will > try as hard as it can to fulfill the request but fails eventually if > the progress cannot be made. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > fs/xfs/kmem.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > index 689f746224e7..34e6b062ce0e 100644 > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags) > lflags &= ~__GFP_FS; > } > > + if (flags & KM_MAYFAIL) > + lflags |= __GFP_RETRY_HARD; > + I don't understand. KM_MAYFAIL means "caller handles allocation failure, so retry on failure is not required." To then map KM_MAYFAIL to a flag that implies the allocation will internally retry to try exceptionally hard to prevent failure seems wrong. IOWs, KM_MAYFAIL means XFS is just using for normal allocator behaviour here, so I'm not sure what problem this change is actually solving and it's not clear from the description.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-16 0:23 ` Dave Chinner @ 2016-06-16 8:03 ` Michal Hocko 2016-06-16 11:26 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-16 8:03 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, LKML On Thu 16-06-16 10:23:02, Dave Chinner wrote: > On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently > > so it relied on the default page allocator behavior for the given set > > of flags. This means that small allocations actually never failed. > > > > Now that we have __GFP_RETRY_HARD flags which works independently on the > > allocation request size we can map KM_MAYFAIL to it. The allocator will > > try as hard as it can to fulfill the request but fails eventually if > > the progress cannot be made. > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > fs/xfs/kmem.h | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > > index 689f746224e7..34e6b062ce0e 100644 > > --- a/fs/xfs/kmem.h > > +++ b/fs/xfs/kmem.h > > @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags) > > lflags &= ~__GFP_FS; > > } > > > > + if (flags & KM_MAYFAIL) > > + lflags |= __GFP_RETRY_HARD; > > + > > I don't understand. KM_MAYFAIL means "caller handles > allocation failure, so retry on failure is not required." To then > map KM_MAYFAIL to a flag that implies the allocation will internally > retry to try exceptionally hard to prevent failure seems wrong. The primary point, which I've tried to describe in the changelog, is that the default allocator behavior is to retry endlessly for small orders. You can override this by using __GFP_NORETRY which doesn't retry at all and fails quite early. My understanding of KM_MAYFAIL is that it can cope with allocation failures. The lack of __GFP_NORETRY made me think that the failure should be prevented as much as possible. __GFP_RETRY_HARD is semantically somwhere in the middle between __GFP_NORETRY and __GFP_NOFAIL semantic independently on the allocation size. Does that make more sense now? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-16 8:03 ` Michal Hocko @ 2016-06-16 11:26 ` Michal Hocko 2016-06-17 18:22 ` Johannes Weiner 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-16 11:26 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, LKML On Thu 16-06-16 10:03:55, Michal Hocko wrote: > On Thu 16-06-16 10:23:02, Dave Chinner wrote: > > On Mon, Jun 06, 2016 at 01:32:16PM +0200, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently > > > so it relied on the default page allocator behavior for the given set > > > of flags. This means that small allocations actually never failed. > > > > > > Now that we have __GFP_RETRY_HARD flags which works independently on the > > > allocation request size we can map KM_MAYFAIL to it. The allocator will > > > try as hard as it can to fulfill the request but fails eventually if > > > the progress cannot be made. > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > --- > > > fs/xfs/kmem.h | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > > > index 689f746224e7..34e6b062ce0e 100644 > > > --- a/fs/xfs/kmem.h > > > +++ b/fs/xfs/kmem.h > > > @@ -54,6 +54,9 @@ kmem_flags_convert(xfs_km_flags_t flags) > > > lflags &= ~__GFP_FS; > > > } > > > > > > + if (flags & KM_MAYFAIL) > > > + lflags |= __GFP_RETRY_HARD; > > > + > > > > I don't understand. KM_MAYFAIL means "caller handles > > allocation failure, so retry on failure is not required." To then > > map KM_MAYFAIL to a flag that implies the allocation will internally > > retry to try exceptionally hard to prevent failure seems wrong. > > The primary point, which I've tried to describe in the changelog, is > that the default allocator behavior is to retry endlessly for small > orders. You can override this by using __GFP_NORETRY which doesn't retry > at all and fails quite early. My understanding of KM_MAYFAIL is that > it can cope with allocation failures. The lack of __GFP_NORETRY made me > think that the failure should be prevented as much as possible. > __GFP_RETRY_HARD is semantically somwhere in the middle between > __GFP_NORETRY and __GFP_NOFAIL semantic independently on the allocation > size. > > Does that make more sense now? I would add the following explanation into the code: diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index 34e6b062ce0e..10708f065191 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags) lflags &= ~__GFP_FS; } + /* + * Default page/slab allocator behavior is to retry for ever + * for small allocations. We can override this behavior by using + * __GFP_RETRY_HARD which will tell the allocator to retry as long + * as it is feasible but rather fail than retry for ever for all + * request sizes. + */ if (flags & KM_MAYFAIL) lflags |= __GFP_RETRY_HARD; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-16 11:26 ` Michal Hocko @ 2016-06-17 18:22 ` Johannes Weiner 2016-06-17 20:30 ` Vlastimil Babka 0 siblings, 1 reply; 22+ messages in thread From: Johannes Weiner @ 2016-06-17 18:22 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka, Rik van Riel, LKML On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote: > @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags) > lflags &= ~__GFP_FS; > } > > + /* > + * Default page/slab allocator behavior is to retry for ever > + * for small allocations. We can override this behavior by using > + * __GFP_RETRY_HARD which will tell the allocator to retry as long > + * as it is feasible but rather fail than retry for ever for all > + * request sizes. > + */ > if (flags & KM_MAYFAIL) > lflags |= __GFP_RETRY_HARD; I think this example shows that __GFP_RETRY_HARD is not a good flag because it conflates two seemingly unrelated semantics; the comment doesn't quite make up for that. When the flag is set, - it allows costly orders to invoke the OOM killer and retry - it allows !costly orders to fail While 1. is obvious from the name, 2. is not. Even if we don't want full-on fine-grained naming for every reclaim methodology and retry behavior, those two things just shouldn't be tied together. I don't see us failing !costly order per default anytime soon, and they are common, so adding a __GFP_MAYFAIL to explicitely override that behavior seems like a good idea to me. That would make the XFS callsite here perfectly obvious. And you can still combine it with __GFP_REPEAT. For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT does the right thing for all orders, and it's self-explanatory: try hard, allow falling back. Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different topic. In the long term, it might be better to provide best-effort per default and simply annotate MAYFAIL/NORETRY callsites that want to give up earlier. Because as I mentioned at LSFMM, it's much easier to identify callsites that have a convenient fallback than callsites that need to "try harder." Everybody thinks their allocations are oh so important. The former is much more specific and uses obvious criteria. Either way, __GFP_MAYFAIL should be on its own. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-17 18:22 ` Johannes Weiner @ 2016-06-17 20:30 ` Vlastimil Babka 2016-06-17 21:39 ` Johannes Weiner 0 siblings, 1 reply; 22+ messages in thread From: Vlastimil Babka @ 2016-06-17 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On 17.6.2016 20:22, Johannes Weiner wrote: > On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote: >> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags) >> lflags &= ~__GFP_FS; >> } >> >> + /* >> + * Default page/slab allocator behavior is to retry for ever >> + * for small allocations. We can override this behavior by using >> + * __GFP_RETRY_HARD which will tell the allocator to retry as long >> + * as it is feasible but rather fail than retry for ever for all >> + * request sizes. >> + */ >> if (flags & KM_MAYFAIL) >> lflags |= __GFP_RETRY_HARD; > > I think this example shows that __GFP_RETRY_HARD is not a good flag > because it conflates two seemingly unrelated semantics; the comment > doesn't quite make up for that. > > When the flag is set, > > - it allows costly orders to invoke the OOM killer and retry No, it's not allowing the OOM killer for costly orders, only non-costly, AFAIK. Mainly it allows more aggressive compaction (especially after my series [1]). > - it allows !costly orders to fail > > While 1. is obvious from the name, 2. is not. Even if we don't want > full-on fine-grained naming for every reclaim methodology and retry > behavior, those two things just shouldn't be tied together. Well, if allocation is not allowed to fail, it's like trying "indefinitely hard" already. Telling it it should "try hard" then doesn't make any sense without also being able to fail. > I don't see us failing !costly order per default anytime soon, and > they are common, so adding a __GFP_MAYFAIL to explicitely override > that behavior seems like a good idea to me. That would make the XFS > callsite here perfectly obvious. > > And you can still combine it with __GFP_REPEAT. But that would mean the following meaningful combinations for non-costly orders (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place). __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail __GFP_MAYFAIL - ??? __GFP_MAYFAIL | __GFP_REPEAT - ??? Which one of the last two tries harder? How specifically? Will they differ by (not) allowing OOM? Won't that be just extra confusing? > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT > does the right thing for all orders, and it's self-explanatory: try > hard, allow falling back. > > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different > topic. In the long term, it might be better to provide best-effort per > default and simply annotate MAYFAIL/NORETRY callsites that want to > give up earlier. Because as I mentioned at LSFMM, it's much easier to > identify callsites that have a convenient fallback than callsites that > need to "try harder." Everybody thinks their allocations are oh so > important. The former is much more specific and uses obvious criteria. For higher-order allocations, best-effort might also mean significant system disruption, not just latency of the allocation itself. One example is hugeltbfs allocations (echo X > .../nr_hugepages) where the admin is willing to pay this cost. But to do that by default and rely on everyone else passing NORETRY wouldn't go far. So I think the TRY_HARD kind of flag makes sense. > Either way, __GFP_MAYFAIL should be on its own. [1] https://lwn.net/Articles/689154/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-17 20:30 ` Vlastimil Babka @ 2016-06-17 21:39 ` Johannes Weiner 2016-06-20 8:08 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Johannes Weiner @ 2016-06-17 21:39 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote: > On 17.6.2016 20:22, Johannes Weiner wrote: > > On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote: > >> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags) > >> lflags &= ~__GFP_FS; > >> } > >> > >> + /* > >> + * Default page/slab allocator behavior is to retry for ever > >> + * for small allocations. We can override this behavior by using > >> + * __GFP_RETRY_HARD which will tell the allocator to retry as long > >> + * as it is feasible but rather fail than retry for ever for all > >> + * request sizes. > >> + */ > >> if (flags & KM_MAYFAIL) > >> lflags |= __GFP_RETRY_HARD; > > > > I think this example shows that __GFP_RETRY_HARD is not a good flag > > because it conflates two seemingly unrelated semantics; the comment > > doesn't quite make up for that. > > > > When the flag is set, > > > > - it allows costly orders to invoke the OOM killer and retry > > No, it's not allowing the OOM killer for costly orders, only non-costly, AFAIK. > Mainly it allows more aggressive compaction (especially after my series [1]). Ah, you're right. It calls into the may_oom function but that skips actual killing for costly orders. > > - it allows !costly orders to fail > > > > While 1. is obvious from the name, 2. is not. Even if we don't want > > full-on fine-grained naming for every reclaim methodology and retry > > behavior, those two things just shouldn't be tied together. > > Well, if allocation is not allowed to fail, it's like trying "indefinitely hard" > already. Telling it it should "try hard" then doesn't make any sense without > also being able to fail. I can see that argument, but it's really anything but obvious at the callsite. Dave's response to Michal's patch was a good demonstration. And I don't think adding comments fixes an unintuitive interface. > > I don't see us failing !costly order per default anytime soon, and > > they are common, so adding a __GFP_MAYFAIL to explicitely override > > that behavior seems like a good idea to me. That would make the XFS > > callsite here perfectly obvious. > > > > And you can still combine it with __GFP_REPEAT. > > But that would mean the following meaningful combinations for non-costly orders > (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place). I would ignore order here. Part of what makes this interface unintuitive is when we expect different flags to be passed for different orders, especially because the orders are often variable. Michal's __GFP_RETRY_HARD is an improvement in the sense that it ignores the order and tries to do the right thing regardless of it. The interface should really be about the intent at the callsite, not about implementation details of the allocator. But adding TRY_HARD to express "this can fail" isn't intuitive. > __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail Yeah. Never OOM, never retry etc. The callsite can fall back, and prefers that over OOM kills and disruptive allocation latencies. > __GFP_MAYFAIL - ??? May OOM for certain orders and retry a few times, but still fail. The callsite can fall back, but it wouldn't come for free. E.g. it might have to abort an explicitely requested user operation. This is the default for costly orders, so it has an effect only on non-costly orders. But that's where I would separate interface from implementation: you'd use it e.g. in callsites where you have variable orders but always the same fallback. XFS does that extensively. > __GFP_MAYFAIL | __GFP_REPEAT - ??? > > Which one of the last two tries harder? How specifically? Will they differ by > (not) allowing OOM? Won't that be just extra confusing? Adding __GFP_REPEAT would always be additive. This combination would mean: try the hardest not to fail, but don't lock up in cases when the order happens to be !costly. Again, I'm not too thrilled about that flag as it's so damn vague. But that's more about how we communicate latency/success expectations. My concern is exclusively about its implication of MAYFAIL. > > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT > > does the right thing for all orders, and it's self-explanatory: try > > hard, allow falling back. > > > > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different > > topic. In the long term, it might be better to provide best-effort per > > default and simply annotate MAYFAIL/NORETRY callsites that want to > > give up earlier. Because as I mentioned at LSFMM, it's much easier to > > identify callsites that have a convenient fallback than callsites that > > need to "try harder." Everybody thinks their allocations are oh so > > important. The former is much more specific and uses obvious criteria. > > For higher-order allocations, best-effort might also mean significant system > disruption, not just latency of the allocation itself. One example is hugeltbfs > allocations (echo X > .../nr_hugepages) where the admin is willing to pay this > cost. But to do that by default and rely on everyone else passing NORETRY > wouldn't go far. So I think the TRY_HARD kind of flag makes sense. I think whether the best-effort behavior should be opt-in or opt-out, or how fine-grained the latency/success control over the allocator should be is a different topic. I'd prefer defaulting to reliability and annotating low-latency requirements, but I can see TRY_HARD work too. It just shouldn't imply MAY_FAIL. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-17 21:39 ` Johannes Weiner @ 2016-06-20 8:08 ` Michal Hocko 2016-06-21 4:22 ` Johannes Weiner 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2016-06-20 8:08 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On Fri 17-06-16 17:39:31, Johannes Weiner wrote: > On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote: > > On 17.6.2016 20:22, Johannes Weiner wrote: [...] > > > - it allows !costly orders to fail > > > > > > While 1. is obvious from the name, 2. is not. Even if we don't want > > > full-on fine-grained naming for every reclaim methodology and retry > > > behavior, those two things just shouldn't be tied together. > > > > Well, if allocation is not allowed to fail, it's like trying "indefinitely hard" > > already. Telling it it should "try hard" then doesn't make any sense without > > also being able to fail. > > I can see that argument, but it's really anything but obvious at the > callsite. Dave's response to Michal's patch was a good demonstration. > And I don't think adding comments fixes an unintuitive interface. Yeah, I am aware of that. And it is unfortunate but a side effect of our !costly vs. costly difference in the default behavior. What I wanted to achieve was to have overrides for the default behavior (whatever it is). We already have two such flags and having something semantically in the middle sounds like a consistent way to me. > > > I don't see us failing !costly order per default anytime soon, and > > > they are common, so adding a __GFP_MAYFAIL to explicitely override > > > that behavior seems like a good idea to me. That would make the XFS > > > callsite here perfectly obvious. > > > > > > And you can still combine it with __GFP_REPEAT. > > > > But that would mean the following meaningful combinations for non-costly orders > > (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place). > > I would ignore order here. Part of what makes this interface > unintuitive is when we expect different flags to be passed for > different orders, especially because the orders are often > variable. Michal's __GFP_RETRY_HARD is an improvement in the sense > that it ignores the order and tries to do the right thing regardless > of it. The interface should really be about the intent at the > callsite, not about implementation details of the allocator. > > But adding TRY_HARD to express "this can fail" isn't intuitive. I am all for a better name but everything else I could come up with was just more confusing. Take __GFP_MAYFAIL as an example. How it would be any less confusing? Aren't all the requests which do not have __GFP_NOFAIL automatically MAYFAIL? RETRY_HARD was an attempt to tell you can retry as hard as you find reasonable but fail eventually which should fit quite nicely between NORETRY and NOFAIL. > > __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail > > Yeah. Never OOM, never retry etc. The callsite can fall back, and > prefers that over OOM kills and disruptive allocation latencies. > > > __GFP_MAYFAIL - ??? > > May OOM for certain orders and retry a few times, but still fail. The > callsite can fall back, but it wouldn't come for free. E.g. it might > have to abort an explicitely requested user operation. > > This is the default for costly orders, so it has an effect only on > non-costly orders. But that's where I would separate interface from > implementation: you'd use it e.g. in callsites where you have variable > orders but always the same fallback. XFS does that extensively. > > > __GFP_MAYFAIL | __GFP_REPEAT - ??? > > > > Which one of the last two tries harder? How specifically? Will they differ by > > (not) allowing OOM? Won't that be just extra confusing? > > Adding __GFP_REPEAT would always be additive. This combination would > mean: try the hardest not to fail, but don't lock up in cases when the > order happens to be !costly. > > Again, I'm not too thrilled about that flag as it's so damn vague. But > that's more about how we communicate latency/success expectations. My > concern is exclusively about its implication of MAYFAIL. Our gfp flags space is quite full and additing a new flag while we keep one with a vague meaning doesn't sound very well to me. So I really think we should just ditch __GFP_REPEAT. Whether __GFP_MAYFAIL is a better name for the new flag I dunno. It feels confusing to me but if that is a general agreement I don't have a big problem with that. > > > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT > > > does the right thing for all orders, and it's self-explanatory: try > > > hard, allow falling back. > > > > > > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different > > > topic. In the long term, it might be better to provide best-effort per > > > default and simply annotate MAYFAIL/NORETRY callsites that want to > > > give up earlier. Because as I mentioned at LSFMM, it's much easier to > > > identify callsites that have a convenient fallback than callsites that > > > need to "try harder." Everybody thinks their allocations are oh so > > > important. The former is much more specific and uses obvious criteria. > > > > For higher-order allocations, best-effort might also mean significant system > > disruption, not just latency of the allocation itself. One example is hugeltbfs > > allocations (echo X > .../nr_hugepages) where the admin is willing to pay this > > cost. But to do that by default and rely on everyone else passing NORETRY > > wouldn't go far. So I think the TRY_HARD kind of flag makes sense. > > I think whether the best-effort behavior should be opt-in or opt-out, > or how fine-grained the latency/success control over the allocator > should be is a different topic. I'd prefer defaulting to reliability > and annotating low-latency requirements, but I can see TRY_HARD work > too. It just shouldn't imply MAY_FAIL. It is always hard to change the default behavior without breaking anything. Up to now we had opt-in and as you can see there are not that many users who really wanted to have higher reliability. I guess this is because they just do not care and didn't see too many failures. The opt-out has also a disadvantage that we would need to provide a flag to tell to try less hard and all we have is NORETRY and that is way too easy. So to me it sounds like the opt-in fits better with the current usage. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-20 8:08 ` Michal Hocko @ 2016-06-21 4:22 ` Johannes Weiner 2016-06-21 9:29 ` Vlastimil Babka 2016-06-21 17:00 ` Michal Hocko 0 siblings, 2 replies; 22+ messages in thread From: Johannes Weiner @ 2016-06-21 4:22 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On Mon, Jun 20, 2016 at 10:08:56AM +0200, Michal Hocko wrote: > On Fri 17-06-16 17:39:31, Johannes Weiner wrote: > > On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote: > > > On 17.6.2016 20:22, Johannes Weiner wrote: > [...] > > > > - it allows !costly orders to fail > > > > > > > > While 1. is obvious from the name, 2. is not. Even if we don't want > > > > full-on fine-grained naming for every reclaim methodology and retry > > > > behavior, those two things just shouldn't be tied together. > > > > > > Well, if allocation is not allowed to fail, it's like trying "indefinitely hard" > > > already. Telling it it should "try hard" then doesn't make any sense without > > > also being able to fail. > > > > I can see that argument, but it's really anything but obvious at the > > callsite. Dave's response to Michal's patch was a good demonstration. > > And I don't think adding comments fixes an unintuitive interface. > > Yeah, I am aware of that. And it is unfortunate but a side effect of our > !costly vs. costly difference in the default behavior. What I wanted > to achieve was to have overrides for the default behavior (whatever it > is). We already have two such flags and having something semantically in > the middle sounds like a consistent way to me. The "whatever it is" is the problem I'm having. It's one flag that does two entirely orthogonal things, and it's quite reasonable for somebody to want to change one behavior without the other. "I can handle allocation failures, even when they are !costly" has really nothing to do with "I am ready to pay high a allocation latency to make costly allocations succeed." So while I understand that you want an effort-flag leveled somewhere between NORETRY and NOFAIL, this looks more like a theoretical thing than what existing callsites actually would want to use. > > I think whether the best-effort behavior should be opt-in or opt-out, > > or how fine-grained the latency/success control over the allocator > > should be is a different topic. I'd prefer defaulting to reliability > > and annotating low-latency requirements, but I can see TRY_HARD work > > too. It just shouldn't imply MAY_FAIL. > > It is always hard to change the default behavior without breaking > anything. Up to now we had opt-in and as you can see there are not that > many users who really wanted to have higher reliability. I guess this is > because they just do not care and didn't see too many failures. The > opt-out has also a disadvantage that we would need to provide a flag > to tell to try less hard and all we have is NORETRY and that is way too > easy. So to me it sounds like the opt-in fits better with the current > usage. For costly allocations, the presence of __GFP_NORETRY is exactly the same as the absence of __GFP_REPEAT. So if we made __GFP_REPEAT the default (and deleted the flag), the opt-outs would use __GFP_NORETRY to restore their original behavior. As for changing the default - remember that we currently warn about allocation failures as if they were bugs, unless they are explicitely allocated with the __GFP_NOWARN flag. We can assume that the current __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back rather than incurring latency (otherwise they would have added the __GFP_REPEAT flag). These sites would be a good list of candidates to annotate with __GFP_NORETRY. If we made __GFP_REPEAT then the default, the sites that would then try harder are the same sites that would now emit page allocation failure warnings. These are rare, and the only times I have seen them is under enough load that latency is shot to hell anyway. So I'm not really convinced by the regression argument. But that would *actually* clean up the flags, not make them even more confusing: Allocations that can't ever handle failure would use __GFP_NOFAIL. Callers like XFS would use __GFP_MAYFAIL specifically to disable the implicit __GFP_NOFAIL of !costly allocations. Callers that would prefer falling back over killing and looping would use __GFP_NORETRY. Wouldn't that cover all usecases and be much more intuitive, both in the default behavior as well as in the names of the flags? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-21 4:22 ` Johannes Weiner @ 2016-06-21 9:29 ` Vlastimil Babka 2016-06-21 17:00 ` Michal Hocko 1 sibling, 0 replies; 22+ messages in thread From: Vlastimil Babka @ 2016-06-21 9:29 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On 06/21/2016 06:22 AM, Johannes Weiner wrote: >>> I think whether the best-effort behavior should be opt-in or opt-out, >>> or how fine-grained the latency/success control over the allocator >>> should be is a different topic. I'd prefer defaulting to reliability >>> and annotating low-latency requirements, but I can see TRY_HARD work >>> too. It just shouldn't imply MAY_FAIL. >> >> It is always hard to change the default behavior without breaking >> anything. Up to now we had opt-in and as you can see there are not that >> many users who really wanted to have higher reliability. I guess this is >> because they just do not care and didn't see too many failures. The >> opt-out has also a disadvantage that we would need to provide a flag >> to tell to try less hard and all we have is NORETRY and that is way too >> easy. So to me it sounds like the opt-in fits better with the current >> usage. > > For costly allocations, the presence of __GFP_NORETRY is exactly the > same as the absence of __GFP_REPEAT. So if we made __GFP_REPEAT the > default (and deleted the flag), the opt-outs would use __GFP_NORETRY > to restore their original behavior. Just FYI, this argument distorts my idea how to get rid of hacky checks for GFP_TRANSHUGE and PF_KTHREAD (patches 05 and 06 in [1]), where I observed the mentioned no difference between __GFP_NORETRY presence and __GFP_REPEAT absence, and made use of it. Without __GFP_REPEAT I'd have two options for khugepaged and madvise(MADV_HUGEPAGE) allocations. Either pass __GFP_NORETRY and make them fail more, or don't and then they become much more disruptive (if the default becomes best-effort, i.e. what __GFP_REPEAT used to do). [1] http://thread.gmane.org/gmane.linux.kernel.mm/152313 > As for changing the default - remember that we currently warn about > allocation failures as if they were bugs, unless they are explicitely > allocated with the __GFP_NOWARN flag. We can assume that the current > __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back > rather than incurring latency (otherwise they would have added the > __GFP_REPEAT flag). These sites would be a good list of candidates to > annotate with __GFP_NORETRY. If we made __GFP_REPEAT then the default, > the sites that would then try harder are the same sites that would now > emit page allocation failure warnings. These are rare, and the only > times I have seen them is under enough load that latency is shot to > hell anyway. So I'm not really convinced by the regression argument. > > But that would *actually* clean up the flags, not make them even more > confusing: > > Allocations that can't ever handle failure would use __GFP_NOFAIL. > > Callers like XFS would use __GFP_MAYFAIL specifically to disable the > implicit __GFP_NOFAIL of !costly allocations. > > Callers that would prefer falling back over killing and looping would > use __GFP_NORETRY. > > Wouldn't that cover all usecases and be much more intuitive, both in > the default behavior as well as in the names of the flags? > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD 2016-06-21 4:22 ` Johannes Weiner 2016-06-21 9:29 ` Vlastimil Babka @ 2016-06-21 17:00 ` Michal Hocko 1 sibling, 0 replies; 22+ messages in thread From: Michal Hocko @ 2016-06-21 17:00 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Dave Chinner, linux-mm, Andrew Morton, Mel Gorman, Rik van Riel, LKML On Tue 21-06-16 00:22:49, Johannes Weiner wrote: [...] > As for changing the default - remember that we currently warn about > allocation failures as if they were bugs, unless they are explicitely > allocated with the __GFP_NOWARN flag. We can assume that the current > __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back > rather than incurring latency (otherwise they would have added the > __GFP_REPEAT flag). These sites would be a good list of candidates to > annotate with __GFP_NORETRY. This sounds like a good idea at first sight but a brief git grep shows that many of them are just trying to silence the warning from non sleeping allocations which are quite likely to fail. This wouldn't be hard to filter out so we can ignore them. Then there are things like 8be04b9374e5 ("treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks") where the flag is added to many places with vmalloc fallbacks. Do we want to weaken them in favor of the vmalloc in general (note that it is not clear from the size whether they are costly or !costly)? I have looked at some random others and they are adding the flag without any explanation so it is not really clear what was the motivation. To me it seems like the flag is used quite randomly. I suspect there are many places which do not have that flag just because nobody bothered to report the allocation failure which is hard to reproduce. > If we made __GFP_REPEAT then the default, > the sites that would then try harder are the same sites that would now > emit page allocation failure warnings. These are rare, and the only > times I have seen them is under enough load that latency is shot to > hell anyway. So I'm not really convinced by the regression argument. You do not need to be under a heavy load to fail those allocations. It is sufficient to have the memory fragmented which might be just a matter of time. I am worried that we have hard to examine number of allocation requests that might change the overall system behavior because they might trigger more reclaim/swap and the source of the behavior change wouldn't be quite obvious. On the other hand we already have some places already annotated to require a more effort which is the reason I would find it better to follow up with that. High order allocations can be really expensive and the current behavior with the allocation warning has an advantage that we can see the failure mode and get a bug report with the exact trace (hopefully) without too much of a background interference. Then the subsystem familiar person can judge whether that particular allocation is worth more effort or different fallback. I am not saying that changing the default behavior for costly allocations is a no go. I just feel it is too risky and it would be better to use "override the default because I know what I am doing" flag. The __GFP_RETRY_HARD might be a terrible name and a better name would cause less confusion (__GFP_RETRY_MAYFAIL?). > But that would *actually* clean up the flags, not make them even more > confusing: > > Allocations that can't ever handle failure would use __GFP_NOFAIL. > > Callers like XFS would use __GFP_MAYFAIL specifically to disable the > implicit __GFP_NOFAIL of !costly allocations. > > Callers that would prefer falling back over killing and looping would > use __GFP_NORETRY. > > Wouldn't that cover all usecases and be much more intuitive, both in > the default behavior as well as in the names of the flags? How do we describe kcompacd vs. page fault THP allocations? We do not want to cause a lot of reclaim for those but we can wait for compaction for the first while we would prefer not to for the later. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic 2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko @ 2016-10-06 11:14 ` Michal Hocko 2 siblings, 0 replies; 22+ messages in thread From: Michal Hocko @ 2016-10-06 11:14 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Johannes Weiner, Rik van Riel, Dave Chinner, LKML Hi, I would like to revive this thread. It seems we haven't found any conclusion. The main objection so far has been that the __GFP_RETRY_HARD may-fail semantic is not clear and that it is mixing two things (fail for !costly and repeat for costly orders) together. As for the first part I guess we can think of a better name. __GFP_RETRY_MAYFAIL should be more clear IMHO. For the second objection Johannes has suggested a separate __GFP_MAYFAIL flag. In one variant we would keep __GFP_REPEAT as a modifier to try harder on top. My main objection was that the gfp flag space is really scarce but we received some relief after http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz so maybe we can reconsider this. I am not convinced that a separate MAYFAIL flag would make situation much better semantic wise, though. a) it adds de-facto another gfp order specific flag and we have seen that this concept is not really working well from the past and b) the name doesn't really help users to make clear how hard the allocator tries if we keep GFP_REPEAT - e.g. what is the difference between GFP_KERNEL|__GFP_MAYFAIL and GFP_KERNEL|__GFP_MAYFAIL|__GFP_REPEAT? One way to differentiate the two would be to not trigger the OOM killer in the first case which would make it similar to GFP_NORETRY except we would retry as long as there is a progress and fail right before the OOM would be declared. __GFP_REPEAT on top would invoke OOM for !costly requests and keep retrying until we hit the OOM again in the same path. To me it sounds quite complicated and the OOM behavior subtle and non intuitive. An alternative would be to ignore __GFP_REPEAT for !costly orders and have it costly specific. Another suggestion by Johannes was to make __GFP_REPEAT default for costly orders and drop the flag. __GFP_NORETRY can be used to opt-out and have the previous behavior. __GFP_MAYFAIL would then be more clear but the main problem then is that many allocation requests could turn out to be really disruptive leading to performance issues which would be not that easy to debug. While we only have few GFP_REPEAT users (~30) currently there are way much more allocations which would be more aggressive now. So the question is how to identify those which should be changed to NORETRY and how to tell users when to opt-out in newly added allocation sites. To me this sounds really risky and whack-a-mole games. Does anybody see other option(s)? I fully realize the situation is not easy and our costly vs. !costly heritage is hard to come around. Whatever we do can end up in a similar mess we had with GFP_REPEAT but I believe that we should find a way to tell that a particular request is OK to fail and override default allocator policy. To me __GFP_RETRY_MAYFAIL with "retry as long as there are reasonable chances to succeed and then fail" semantic sounds the most coherent and attractive because it is not order specific from the user point of view and there is a zero risk of reggressions. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-10-06 11:14 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-06 11:32 [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic Michal Hocko 2016-06-07 12:11 ` Tetsuo Handa 2016-06-07 12:31 ` Michal Hocko 2016-06-11 14:35 ` Tetsuo Handa 2016-06-13 11:37 ` Michal Hocko 2016-06-13 14:54 ` Tetsuo Handa 2016-06-13 15:17 ` Michal Hocko 2016-06-14 11:12 ` Tetsuo Handa 2016-06-14 18:54 ` Michal Hocko 2016-06-06 11:32 ` [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD Michal Hocko 2016-06-16 0:23 ` Dave Chinner 2016-06-16 8:03 ` Michal Hocko 2016-06-16 11:26 ` Michal Hocko 2016-06-17 18:22 ` Johannes Weiner 2016-06-17 20:30 ` Vlastimil Babka 2016-06-17 21:39 ` Johannes Weiner 2016-06-20 8:08 ` Michal Hocko 2016-06-21 4:22 ` Johannes Weiner 2016-06-21 9:29 ` Vlastimil Babka 2016-06-21 17:00 ` Michal Hocko 2016-10-06 11:14 ` [RFC PATCH 0/2] mm: give GFP_REPEAT a better semantic Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).