All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] __GFP_REPEAT cleanup
@ 2015-11-05 16:15 ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML

Hi,
while working on something unrelated I've checked the current usage
of __GFP_REPEAT in the tree. It seems that a good half of it is
and always has been bogus because __GFP_REPEAT has always been about
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a copy&paste
when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear.

The series is based on linux-next tree and
$ git grep __GFP_REPEAT next/master | wc -l
106

and with the patch
$ git grep __GFP_REPEAT  | wc -l
44

There are probably more users which do not need the flag but I have focused
on the trivially superfluous ones here.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 0/3] __GFP_REPEAT cleanup
@ 2015-11-05 16:15 ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML

Hi,
while working on something unrelated I've checked the current usage
of __GFP_REPEAT in the tree. It seems that a good half of it is
and always has been bogus because __GFP_REPEAT has always been about
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a copy&paste
when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear.

The series is based on linux-next tree and
$ git grep __GFP_REPEAT next/master | wc -l
106

and with the patch
$ git grep __GFP_REPEAT  | wc -l
44

There are probably more users which do not need the flag but I have focused
on the trivially superfluous ones here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-05 16:15 ` mhocko
@ 2015-11-05 16:15   ` mhocko
  -1 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we have
the full kernel tree with its usage for apparently order-0 allocations.
This is really confusing because __GFP_REPEAT is explicitly documented
to allow allocation failures which is a weaker semantic than the current
order-0 has (basically nofail).

Let's simply reap out __GFP_REPEAT from those places. This would allow
to identify place which really need allocator to retry harder and
formulate a more specific semantic for what the flag is supposed to do
actually.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/pgalloc.h         | 4 ++--
 arch/arm/include/asm/pgalloc.h           | 2 +-
 arch/avr32/include/asm/pgalloc.h         | 6 +++---
 arch/cris/include/asm/pgalloc.h          | 4 ++--
 arch/frv/mm/pgalloc.c                    | 6 +++---
 arch/hexagon/include/asm/pgalloc.h       | 4 ++--
 arch/m68k/include/asm/mcf_pgalloc.h      | 4 ++--
 arch/m68k/include/asm/motorola_pgalloc.h | 4 ++--
 arch/m68k/include/asm/sun3_pgalloc.h     | 4 ++--
 arch/metag/include/asm/pgalloc.h         | 5 ++---
 arch/microblaze/include/asm/pgalloc.h    | 4 ++--
 arch/microblaze/mm/pgtable.c             | 3 +--
 arch/mn10300/mm/pgtable.c                | 6 +++---
 arch/openrisc/include/asm/pgalloc.h      | 2 +-
 arch/openrisc/mm/ioremap.c               | 2 +-
 arch/parisc/include/asm/pgalloc.h        | 4 ++--
 arch/powerpc/include/asm/pgalloc-64.h    | 2 +-
 arch/powerpc/mm/pgtable_32.c             | 4 ++--
 arch/powerpc/mm/pgtable_64.c             | 3 +--
 arch/s390/mm/pgtable.c                   | 2 +-
 arch/sh/include/asm/pgalloc.h            | 4 ++--
 arch/sparc/mm/init_64.c                  | 6 ++----
 arch/um/kernel/mem.c                     | 4 ++--
 arch/x86/include/asm/pgalloc.h           | 4 ++--
 arch/x86/xen/p2m.c                       | 2 +-
 arch/xtensa/include/asm/pgalloc.h        | 2 +-
 drivers/block/aoe/aoecmd.c               | 2 +-
 27 files changed, 47 insertions(+), 52 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index aab14a019c20..c2ebb6f36c9d 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -40,7 +40,7 @@ pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return ret;
 }
 
@@ -53,7 +53,7 @@ pmd_free(struct mm_struct *mm, pmd_t *pmd)
 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 19cfab526d13..20febb368844 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -29,7 +29,7 @@
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL | __GFP_REPEAT);
+	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
diff --git a/arch/avr32/include/asm/pgalloc.h b/arch/avr32/include/asm/pgalloc.h
index 1aba19d68c5e..db039cb368be 100644
--- a/arch/avr32/include/asm/pgalloc.h
+++ b/arch/avr32/include/asm/pgalloc.h
@@ -43,7 +43,7 @@ static inline void pgd_ctor(void *x)
  */
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return quicklist_alloc(QUICK_PGD, GFP_KERNEL | __GFP_REPEAT, pgd_ctor);
+	return quicklist_alloc(QUICK_PGD, GFP_KERNEL, pgd_ctor);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
@@ -54,7 +54,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
@@ -63,7 +63,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 	struct page *page;
 	void *pg;
 
-	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 	if (!pg)
 		return NULL;
 
diff --git a/arch/cris/include/asm/pgalloc.h b/arch/cris/include/asm/pgalloc.h
index 235ece437ddd..42f1affb9c2d 100644
--- a/arch/cris/include/asm/pgalloc.h
+++ b/arch/cris/include/asm/pgalloc.h
@@ -24,14 +24,14 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
  	return pte;
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/frv/mm/pgalloc.c b/arch/frv/mm/pgalloc.c
index 41907d25ed38..c9ed14f6c67d 100644
--- a/arch/frv/mm/pgalloc.c
+++ b/arch/frv/mm/pgalloc.c
@@ -22,7 +22,7 @@ pgd_t swapper_pg_dir[PTRS_PER_PGD] __attribute__((aligned(PAGE_SIZE)));
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (pte)
 		clear_page(pte);
 	return pte;
@@ -33,9 +33,9 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	struct page *page;
 
 #ifdef CONFIG_HIGHPTE
-	page = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	page = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM, 0);
 #else
-	page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	page = alloc_pages(GFP_KERNEL, 0);
 #endif
 	if (!page)
 		return NULL;
diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h
index 77da3b0ae3c2..eeebf862c46c 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -64,7 +64,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	pte = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
@@ -78,7 +78,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	gfp_t flags =  GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO;
+	gfp_t flags =  GFP_KERNEL | __GFP_ZERO;
 	return (pte_t *) __get_free_page(flags);
 }
 
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index f9924fbcfe42..fb95aed5f428 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -14,7 +14,7 @@ extern const char bad_pmd_string[];
 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	unsigned long page = __get_free_page(GFP_DMA|__GFP_REPEAT);
+	unsigned long page = __get_free_page(GFP_DMA);
 
 	if (!page)
 		return NULL;
@@ -51,7 +51,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t page,
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_DMA|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_DMA, 0);
 	pte_t *pte;
 
 	if (!page)
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index 24bcba496c75..c895b987202c 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -11,7 +11,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long ad
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	if (pte) {
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
@@ -32,7 +32,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addres
 	struct page *page;
 	pte_t *pte;
 
-	page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
+	page = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if(!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h
index 0931388de47f..1901f61f926f 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -37,7 +37,7 @@ do {							\
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	unsigned long page = __get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	unsigned long page = __get_free_page(GFP_KERNEL);
 
 	if (!page)
 		return NULL;
@@ -49,7 +49,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 					unsigned long address)
 {
-        struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+        struct page *page = alloc_pages(GFP_KERNEL, 0);
 
 	if (page == NULL)
 		return NULL;
diff --git a/arch/metag/include/asm/pgalloc.h b/arch/metag/include/asm/pgalloc.h
index 3104df0a4822..c2caa1ee4360 100644
--- a/arch/metag/include/asm/pgalloc.h
+++ b/arch/metag/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT |
-					      __GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	return pte;
 }
 
@@ -51,7 +50,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 				      unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	pte = alloc_pages(GFP_KERNEL  | __GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h
index 61436d69775c..7c89390c0c13 100644
--- a/arch/microblaze/include/asm/pgalloc.h
+++ b/arch/microblaze/include/asm/pgalloc.h
@@ -116,9 +116,9 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 	struct page *ptepage;
 
 #ifdef CONFIG_HIGHPTE
-	int flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_REPEAT;
+	int flags = GFP_KERNEL | __GFP_HIGHMEM;
 #else
-	int flags = GFP_KERNEL | __GFP_REPEAT;
+	int flags = GFP_KERNEL;
 #endif
 
 	ptepage = alloc_pages(flags, 0);
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index 4f4520e779a5..eb99fcc76088 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -239,8 +239,7 @@ __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL |
-					__GFP_REPEAT | __GFP_ZERO);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	} else {
 		pte = (pte_t *)early_get_page();
 		if (pte)
diff --git a/arch/mn10300/mm/pgtable.c b/arch/mn10300/mm/pgtable.c
index e77a7c728081..9577cf768875 100644
--- a/arch/mn10300/mm/pgtable.c
+++ b/arch/mn10300/mm/pgtable.c
@@ -63,7 +63,7 @@ void set_pmd_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags)
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (pte)
 		clear_page(pte);
 	return pte;
@@ -74,9 +74,9 @@ struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	struct page *pte;
 
 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL, 0);
 #endif
 	if (!pte)
 		return NULL;
diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h
index 21484e5b9e9a..87eebd185089 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -77,7 +77,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL, 0);
 	if (!pte)
 		return NULL;
 	clear_page(page_address(pte));
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index 62b08ef392be..5b2a95116e8f 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -122,7 +122,7 @@ pte_t __init_refok *pte_alloc_one_kernel(struct mm_struct *mm,
 	pte_t *pte;
 
 	if (likely(mem_init_done)) {
-		pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
+		pte = (pte_t *) __get_free_page(GFP_KERNEL);
 	} else {
 		pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
 #if 0
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index 3edbb9fc91b4..b7e4027aac4b 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -124,7 +124,7 @@ pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
 static inline pgtable_t
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
@@ -137,7 +137,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 4b0be20fcbfd..4e19f734447b 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -79,7 +79,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 7692d1bb1bc6..494b1838485c 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -109,7 +109,7 @@ __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long add
 	pte_t *pte;
 
 	if (slab_is_available()) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	} else {
 		pte = __va(memblock_alloc(PAGE_SIZE, PAGE_SIZE));
 		if (pte)
@@ -122,7 +122,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *ptepage;
 
-	gfp_t flags = GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO;
+	gfp_t flags = GFP_KERNEL | __GFP_ZERO;
 
 	ptepage = alloc_pages(flags, 0);
 	if (!ptepage)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 422c59a24561..4e225504ed19 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -386,8 +386,7 @@ static pte_t *get_from_cache(struct mm_struct *mm)
 static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel)
 {
 	void *ret = NULL;
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!kernel && !pgtable_page_ctor(page)) {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 34f3790fe459..76c98c344c60 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -966,7 +966,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			return table;
 	}
 	/* Allocate a fresh page */
-	page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+	page = alloc_page(GFP_KERNEL);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index a33673b3687d..f3f42c84c40f 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -34,7 +34,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
@@ -43,7 +43,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 	struct page *page;
 	void *pg;
 
-	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 	if (!pg)
 		return NULL;
 	page = virt_to_page(pg);
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3025bd57f7ab..1dee80aaed0f 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2712,8 +2712,7 @@ void __flush_tlb_all(void)
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 			    unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	pte_t *pte = NULL;
 
 	if (page)
@@ -2725,8 +2724,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 pgtable_t pte_alloc_one(struct mm_struct *mm,
 			unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index b2a2dff50b4e..e7437ec62710 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -204,7 +204,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
@@ -212,7 +212,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
 
-	pte = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = alloc_page(GFP_KERNEL|__GFP_ZERO);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..574c23cf761a 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page;
-	page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	page = alloc_pages(GFP_KERNEL |  __GFP_ZERO, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +125,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)get_zeroed_page(GFP_KERNEL);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index cab9f766bb06..dd2a49a8aacc 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -182,7 +182,7 @@ static void * __ref alloc_p2m_page(void)
 	if (unlikely(!slab_is_available()))
 		return alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
 
-	return (void *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
+	return (void *)__get_free_page(GFP_KERNEL);
 }
 
 static void __ref free_p2m_page(void *p)
diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h
index d38eb9237e64..1065bc8bcae5 100644
--- a/arch/xtensa/include/asm/pgalloc.h
+++ b/arch/xtensa/include/asm/pgalloc.h
@@ -44,7 +44,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	pte_t *ptep;
 	int i;
 
-	ptep = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	ptep = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (!ptep)
 		return NULL;
 	for (i = 0; i < 1024; i++)
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index ad80c85e0857..5ac30cbba002 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -1750,7 +1750,7 @@ aoecmd_init(void)
 	int ret;
 
 	/* get_zeroed_page returns page with ref count 1 */
-	p = (void *) get_zeroed_page(GFP_KERNEL | __GFP_REPEAT);
+	p = (void *) get_zeroed_page(GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 	empty_page = virt_to_page(p);
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-05 16:15   ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we have
the full kernel tree with its usage for apparently order-0 allocations.
This is really confusing because __GFP_REPEAT is explicitly documented
to allow allocation failures which is a weaker semantic than the current
order-0 has (basically nofail).

Let's simply reap out __GFP_REPEAT from those places. This would allow
to identify place which really need allocator to retry harder and
formulate a more specific semantic for what the flag is supposed to do
actually.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/pgalloc.h         | 4 ++--
 arch/arm/include/asm/pgalloc.h           | 2 +-
 arch/avr32/include/asm/pgalloc.h         | 6 +++---
 arch/cris/include/asm/pgalloc.h          | 4 ++--
 arch/frv/mm/pgalloc.c                    | 6 +++---
 arch/hexagon/include/asm/pgalloc.h       | 4 ++--
 arch/m68k/include/asm/mcf_pgalloc.h      | 4 ++--
 arch/m68k/include/asm/motorola_pgalloc.h | 4 ++--
 arch/m68k/include/asm/sun3_pgalloc.h     | 4 ++--
 arch/metag/include/asm/pgalloc.h         | 5 ++---
 arch/microblaze/include/asm/pgalloc.h    | 4 ++--
 arch/microblaze/mm/pgtable.c             | 3 +--
 arch/mn10300/mm/pgtable.c                | 6 +++---
 arch/openrisc/include/asm/pgalloc.h      | 2 +-
 arch/openrisc/mm/ioremap.c               | 2 +-
 arch/parisc/include/asm/pgalloc.h        | 4 ++--
 arch/powerpc/include/asm/pgalloc-64.h    | 2 +-
 arch/powerpc/mm/pgtable_32.c             | 4 ++--
 arch/powerpc/mm/pgtable_64.c             | 3 +--
 arch/s390/mm/pgtable.c                   | 2 +-
 arch/sh/include/asm/pgalloc.h            | 4 ++--
 arch/sparc/mm/init_64.c                  | 6 ++----
 arch/um/kernel/mem.c                     | 4 ++--
 arch/x86/include/asm/pgalloc.h           | 4 ++--
 arch/x86/xen/p2m.c                       | 2 +-
 arch/xtensa/include/asm/pgalloc.h        | 2 +-
 drivers/block/aoe/aoecmd.c               | 2 +-
 27 files changed, 47 insertions(+), 52 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index aab14a019c20..c2ebb6f36c9d 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -40,7 +40,7 @@ pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return ret;
 }
 
@@ -53,7 +53,7 @@ pmd_free(struct mm_struct *mm, pmd_t *pmd)
 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 19cfab526d13..20febb368844 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -29,7 +29,7 @@
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL | __GFP_REPEAT);
+	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
diff --git a/arch/avr32/include/asm/pgalloc.h b/arch/avr32/include/asm/pgalloc.h
index 1aba19d68c5e..db039cb368be 100644
--- a/arch/avr32/include/asm/pgalloc.h
+++ b/arch/avr32/include/asm/pgalloc.h
@@ -43,7 +43,7 @@ static inline void pgd_ctor(void *x)
  */
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return quicklist_alloc(QUICK_PGD, GFP_KERNEL | __GFP_REPEAT, pgd_ctor);
+	return quicklist_alloc(QUICK_PGD, GFP_KERNEL, pgd_ctor);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
@@ -54,7 +54,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
@@ -63,7 +63,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 	struct page *page;
 	void *pg;
 
-	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 	if (!pg)
 		return NULL;
 
diff --git a/arch/cris/include/asm/pgalloc.h b/arch/cris/include/asm/pgalloc.h
index 235ece437ddd..42f1affb9c2d 100644
--- a/arch/cris/include/asm/pgalloc.h
+++ b/arch/cris/include/asm/pgalloc.h
@@ -24,14 +24,14 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
  	return pte;
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/frv/mm/pgalloc.c b/arch/frv/mm/pgalloc.c
index 41907d25ed38..c9ed14f6c67d 100644
--- a/arch/frv/mm/pgalloc.c
+++ b/arch/frv/mm/pgalloc.c
@@ -22,7 +22,7 @@ pgd_t swapper_pg_dir[PTRS_PER_PGD] __attribute__((aligned(PAGE_SIZE)));
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (pte)
 		clear_page(pte);
 	return pte;
@@ -33,9 +33,9 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	struct page *page;
 
 #ifdef CONFIG_HIGHPTE
-	page = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	page = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM, 0);
 #else
-	page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	page = alloc_pages(GFP_KERNEL, 0);
 #endif
 	if (!page)
 		return NULL;
diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h
index 77da3b0ae3c2..eeebf862c46c 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -64,7 +64,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	pte = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
@@ -78,7 +78,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	gfp_t flags =  GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO;
+	gfp_t flags =  GFP_KERNEL | __GFP_ZERO;
 	return (pte_t *) __get_free_page(flags);
 }
 
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index f9924fbcfe42..fb95aed5f428 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -14,7 +14,7 @@ extern const char bad_pmd_string[];
 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	unsigned long page = __get_free_page(GFP_DMA|__GFP_REPEAT);
+	unsigned long page = __get_free_page(GFP_DMA);
 
 	if (!page)
 		return NULL;
@@ -51,7 +51,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t page,
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_DMA|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_DMA, 0);
 	pte_t *pte;
 
 	if (!page)
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index 24bcba496c75..c895b987202c 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -11,7 +11,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long ad
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	if (pte) {
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
@@ -32,7 +32,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addres
 	struct page *page;
 	pte_t *pte;
 
-	page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
+	page = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if(!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h
index 0931388de47f..1901f61f926f 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -37,7 +37,7 @@ do {							\
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	unsigned long page = __get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	unsigned long page = __get_free_page(GFP_KERNEL);
 
 	if (!page)
 		return NULL;
@@ -49,7 +49,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 					unsigned long address)
 {
-        struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+        struct page *page = alloc_pages(GFP_KERNEL, 0);
 
 	if (page == NULL)
 		return NULL;
diff --git a/arch/metag/include/asm/pgalloc.h b/arch/metag/include/asm/pgalloc.h
index 3104df0a4822..c2caa1ee4360 100644
--- a/arch/metag/include/asm/pgalloc.h
+++ b/arch/metag/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT |
-					      __GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	return pte;
 }
 
@@ -51,7 +50,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 				      unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	pte = alloc_pages(GFP_KERNEL  | __GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h
index 61436d69775c..7c89390c0c13 100644
--- a/arch/microblaze/include/asm/pgalloc.h
+++ b/arch/microblaze/include/asm/pgalloc.h
@@ -116,9 +116,9 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 	struct page *ptepage;
 
 #ifdef CONFIG_HIGHPTE
-	int flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_REPEAT;
+	int flags = GFP_KERNEL | __GFP_HIGHMEM;
 #else
-	int flags = GFP_KERNEL | __GFP_REPEAT;
+	int flags = GFP_KERNEL;
 #endif
 
 	ptepage = alloc_pages(flags, 0);
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index 4f4520e779a5..eb99fcc76088 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -239,8 +239,7 @@ __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL |
-					__GFP_REPEAT | __GFP_ZERO);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	} else {
 		pte = (pte_t *)early_get_page();
 		if (pte)
diff --git a/arch/mn10300/mm/pgtable.c b/arch/mn10300/mm/pgtable.c
index e77a7c728081..9577cf768875 100644
--- a/arch/mn10300/mm/pgtable.c
+++ b/arch/mn10300/mm/pgtable.c
@@ -63,7 +63,7 @@ void set_pmd_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags)
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (pte)
 		clear_page(pte);
 	return pte;
@@ -74,9 +74,9 @@ struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	struct page *pte;
 
 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL, 0);
 #endif
 	if (!pte)
 		return NULL;
diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h
index 21484e5b9e9a..87eebd185089 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -77,7 +77,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL, 0);
 	if (!pte)
 		return NULL;
 	clear_page(page_address(pte));
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index 62b08ef392be..5b2a95116e8f 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -122,7 +122,7 @@ pte_t __init_refok *pte_alloc_one_kernel(struct mm_struct *mm,
 	pte_t *pte;
 
 	if (likely(mem_init_done)) {
-		pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
+		pte = (pte_t *) __get_free_page(GFP_KERNEL);
 	} else {
 		pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
 #if 0
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index 3edbb9fc91b4..b7e4027aac4b 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -124,7 +124,7 @@ pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
 static inline pgtable_t
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
@@ -137,7 +137,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 4b0be20fcbfd..4e19f734447b 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -79,7 +79,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 7692d1bb1bc6..494b1838485c 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -109,7 +109,7 @@ __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long add
 	pte_t *pte;
 
 	if (slab_is_available()) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	} else {
 		pte = __va(memblock_alloc(PAGE_SIZE, PAGE_SIZE));
 		if (pte)
@@ -122,7 +122,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *ptepage;
 
-	gfp_t flags = GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO;
+	gfp_t flags = GFP_KERNEL | __GFP_ZERO;
 
 	ptepage = alloc_pages(flags, 0);
 	if (!ptepage)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 422c59a24561..4e225504ed19 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -386,8 +386,7 @@ static pte_t *get_from_cache(struct mm_struct *mm)
 static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel)
 {
 	void *ret = NULL;
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!kernel && !pgtable_page_ctor(page)) {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 34f3790fe459..76c98c344c60 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -966,7 +966,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			return table;
 	}
 	/* Allocate a fresh page */
-	page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+	page = alloc_page(GFP_KERNEL);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index a33673b3687d..f3f42c84c40f 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -34,7 +34,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
@@ -43,7 +43,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 	struct page *page;
 	void *pg;
 
-	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL | __GFP_REPEAT, NULL);
+	pg = quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 	if (!pg)
 		return NULL;
 	page = virt_to_page(pg);
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3025bd57f7ab..1dee80aaed0f 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2712,8 +2712,7 @@ void __flush_tlb_all(void)
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 			    unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	pte_t *pte = NULL;
 
 	if (page)
@@ -2725,8 +2724,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 pgtable_t pte_alloc_one(struct mm_struct *mm,
 			unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK |
-				       __GFP_REPEAT | __GFP_ZERO);
+	struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index b2a2dff50b4e..e7437ec62710 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -204,7 +204,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
 }
 
@@ -212,7 +212,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
 
-	pte = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	pte = alloc_page(GFP_KERNEL|__GFP_ZERO);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..574c23cf761a 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page;
-	page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	page = alloc_pages(GFP_KERNEL |  __GFP_ZERO, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +125,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)get_zeroed_page(GFP_KERNEL);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index cab9f766bb06..dd2a49a8aacc 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -182,7 +182,7 @@ static void * __ref alloc_p2m_page(void)
 	if (unlikely(!slab_is_available()))
 		return alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
 
-	return (void *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
+	return (void *)__get_free_page(GFP_KERNEL);
 }
 
 static void __ref free_p2m_page(void *p)
diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h
index d38eb9237e64..1065bc8bcae5 100644
--- a/arch/xtensa/include/asm/pgalloc.h
+++ b/arch/xtensa/include/asm/pgalloc.h
@@ -44,7 +44,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	pte_t *ptep;
 	int i;
 
-	ptep = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	ptep = (pte_t *)__get_free_page(GFP_KERNEL);
 	if (!ptep)
 		return NULL;
 	for (i = 0; i < 1024; i++)
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index ad80c85e0857..5ac30cbba002 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -1750,7 +1750,7 @@ aoecmd_init(void)
 	int ret;
 
 	/* get_zeroed_page returns page with ref count 1 */
-	p = (void *) get_zeroed_page(GFP_KERNEL | __GFP_REPEAT);
+	p = (void *) get_zeroed_page(GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 	empty_page = virt_to_page(p);
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/3] tree wide: get rid of __GFP_REPEAT for small order requests
  2015-11-05 16:15 ` mhocko
  (?)
@ 2015-11-05 16:15   ` mhocko
  -1 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko, linux-arch

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we have
users which require this flag even though they are doing order-0 or
small order allocation in the end:

* arc: pte_alloc_one_kernel uses __get_order_pte but this is obviously
  always zero because BITS_FOR_PTE is not larger than 9 yet the page
  size is always larger than 4K
* arm: hides __GFP_REPEAT behind PGALLOC_GFP but the actual allocations
  are always order-0. __pgd_alloc is doing order-2 but this is still
  not costly allocation
* arm64: does basically the same except it can have PGD_SIZE != PAGE_SIZE
  so let's keep __GFP_REPEAT explicit there
* mips, nios2, parisc, score, x86: hide behind PTE_ORDER and PMD_ORDER but
  both are not larger than 1 and this seems like a copy&paste between
  arches.

This is really confusing because __GFP_REPEAT is explicitly documented
to allow allocation failures which is a weaker semantic than the current
order-0 has (basically nofail).

Let's simply reap out __GFP_REPEAT from those places. This would allow
to identify place which really need allocator to retry harder and
formulate a more specific semantic for what the flag is supposed to do
actually.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/arc/include/asm/pgalloc.h    | 4 ++--
 arch/arm/include/asm/pgalloc.h    | 2 +-
 arch/arm/mm/pgd.c                 | 2 +-
 arch/arm64/include/asm/pgalloc.h  | 2 +-
 arch/arm64/mm/pgd.c               | 2 +-
 arch/mips/include/asm/pgalloc.h   | 6 +++---
 arch/nios2/include/asm/pgalloc.h  | 5 ++---
 arch/parisc/include/asm/pgalloc.h | 3 +--
 arch/score/include/asm/pgalloc.h  | 5 ++---
 arch/x86/kernel/espfix_64.c       | 2 +-
 arch/x86/mm/pgtable.c             | 2 +-
 11 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 86ed671286df..3749234b7419 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -95,7 +95,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO,
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
 					 __get_order_pte());
 
 	return pte;
@@ -107,7 +107,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	pgtable_t pte_pg;
 	struct page *page;
 
-	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, __get_order_pte());
+	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL, __get_order_pte());
 	if (!pte_pg)
 		return 0;
 	memzero((void *)pte_pg, PTRS_PER_PTE * sizeof(pte_t));
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 20febb368844..b2902a5cd780 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -57,7 +57,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 static inline void clean_pte_table(pte_t *pte)
 {
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index e683db1b90a3..107a2792d0a2 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -23,7 +23,7 @@
 #define __pgd_alloc()	kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
 #define __pgd_free(pgd)	kfree(pgd)
 #else
-#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, 2)
+#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL, 2)
 #define __pgd_free(pgd)	free_pages((unsigned long)pgd, 2)
 #endif
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index c15053902942..ea515f74ac02 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -26,7 +26,7 @@
 
 #define check_pgt_cache()		do { } while (0)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
 
 #if CONFIG_PGTABLE_LEVELS > 2
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index cb3ba1b812e7..77902d234498 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -35,7 +35,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PGD_SIZE == PAGE_SIZE)
 		return (pgd_t *)__get_free_page(PGALLOC_GFP);
 	else
-		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
+		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP | __GFP_REPEAT);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index b336037e8768..93c079a1cfc8 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -69,7 +69,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -79,7 +79,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
@@ -113,7 +113,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
 
-	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PMD_ORDER);
+	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		pmd_init((unsigned long)pmd, (unsigned long)invalid_pte_table);
 	return pmd;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 6e2985e0a7b9..bb47d08c8ef7 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (pte) {
 		if (!pgtable_page_ctor(pte)) {
 			__free_page(pte);
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index b7e4027aac4b..bd388cfd8141 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -63,8 +63,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmd)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL|__GFP_REPEAT,
-					       PMD_ORDER);
+	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		memset(pmd, 0, PAGE_SIZE<<PMD_ORDER);
 	return pmd;
diff --git a/arch/score/include/asm/pgalloc.h b/arch/score/include/asm/pgalloc.h
index 2e067657db98..49b012d78c1a 100644
--- a/arch/score/include/asm/pgalloc.h
+++ b/arch/score/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4d38416e2a7f..04f89caef9c4 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -57,7 +57,7 @@
 # error "Need more than one PGD for the ESPFIX hack"
 #endif
 
-#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f52caf9c519b..bdbb3213f670 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,7 @@
 #include <asm/fixmap.h>
 #include <asm/mtrr.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/3] tree wide: get rid of __GFP_REPEAT for small order requests
@ 2015-11-05 16:15   ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko, linux-arch

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we have
users which require this flag even though they are doing order-0 or
small order allocation in the end:

* arc: pte_alloc_one_kernel uses __get_order_pte but this is obviously
  always zero because BITS_FOR_PTE is not larger than 9 yet the page
  size is always larger than 4K
* arm: hides __GFP_REPEAT behind PGALLOC_GFP but the actual allocations
  are always order-0. __pgd_alloc is doing order-2 but this is still
  not costly allocation
* arm64: does basically the same except it can have PGD_SIZE != PAGE_SIZE
  so let's keep __GFP_REPEAT explicit there
* mips, nios2, parisc, score, x86: hide behind PTE_ORDER and PMD_ORDER but
  both are not larger than 1 and this seems like a copy&paste between
  arches.

This is really confusing because __GFP_REPEAT is explicitly documented
to allow allocation failures which is a weaker semantic than the current
order-0 has (basically nofail).

Let's simply reap out __GFP_REPEAT from those places. This would allow
to identify place which really need allocator to retry harder and
formulate a more specific semantic for what the flag is supposed to do
actually.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/arc/include/asm/pgalloc.h    | 4 ++--
 arch/arm/include/asm/pgalloc.h    | 2 +-
 arch/arm/mm/pgd.c                 | 2 +-
 arch/arm64/include/asm/pgalloc.h  | 2 +-
 arch/arm64/mm/pgd.c               | 2 +-
 arch/mips/include/asm/pgalloc.h   | 6 +++---
 arch/nios2/include/asm/pgalloc.h  | 5 ++---
 arch/parisc/include/asm/pgalloc.h | 3 +--
 arch/score/include/asm/pgalloc.h  | 5 ++---
 arch/x86/kernel/espfix_64.c       | 2 +-
 arch/x86/mm/pgtable.c             | 2 +-
 11 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 86ed671286df..3749234b7419 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -95,7 +95,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO,
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
 					 __get_order_pte());
 
 	return pte;
@@ -107,7 +107,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	pgtable_t pte_pg;
 	struct page *page;
 
-	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, __get_order_pte());
+	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL, __get_order_pte());
 	if (!pte_pg)
 		return 0;
 	memzero((void *)pte_pg, PTRS_PER_PTE * sizeof(pte_t));
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 20febb368844..b2902a5cd780 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -57,7 +57,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 static inline void clean_pte_table(pte_t *pte)
 {
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index e683db1b90a3..107a2792d0a2 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -23,7 +23,7 @@
 #define __pgd_alloc()	kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
 #define __pgd_free(pgd)	kfree(pgd)
 #else
-#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, 2)
+#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL, 2)
 #define __pgd_free(pgd)	free_pages((unsigned long)pgd, 2)
 #endif
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index c15053902942..ea515f74ac02 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -26,7 +26,7 @@
 
 #define check_pgt_cache()		do { } while (0)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
 
 #if CONFIG_PGTABLE_LEVELS > 2
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index cb3ba1b812e7..77902d234498 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -35,7 +35,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PGD_SIZE == PAGE_SIZE)
 		return (pgd_t *)__get_free_page(PGALLOC_GFP);
 	else
-		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
+		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP | __GFP_REPEAT);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index b336037e8768..93c079a1cfc8 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -69,7 +69,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -79,7 +79,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
@@ -113,7 +113,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
 
-	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PMD_ORDER);
+	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		pmd_init((unsigned long)pmd, (unsigned long)invalid_pte_table);
 	return pmd;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 6e2985e0a7b9..bb47d08c8ef7 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (pte) {
 		if (!pgtable_page_ctor(pte)) {
 			__free_page(pte);
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index b7e4027aac4b..bd388cfd8141 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -63,8 +63,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmd)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL|__GFP_REPEAT,
-					       PMD_ORDER);
+	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		memset(pmd, 0, PAGE_SIZE<<PMD_ORDER);
 	return pmd;
diff --git a/arch/score/include/asm/pgalloc.h b/arch/score/include/asm/pgalloc.h
index 2e067657db98..49b012d78c1a 100644
--- a/arch/score/include/asm/pgalloc.h
+++ b/arch/score/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4d38416e2a7f..04f89caef9c4 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -57,7 +57,7 @@
 # error "Need more than one PGD for the ESPFIX hack"
 #endif
 
-#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f52caf9c519b..bdbb3213f670 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,7 @@
 #include <asm/fixmap.h>
 #include <asm/mtrr.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/3] tree wide: get rid of __GFP_REPEAT for small order requests
@ 2015-11-05 16:15   ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:15 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko, linux-arch

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we have
users which require this flag even though they are doing order-0 or
small order allocation in the end:

* arc: pte_alloc_one_kernel uses __get_order_pte but this is obviously
  always zero because BITS_FOR_PTE is not larger than 9 yet the page
  size is always larger than 4K
* arm: hides __GFP_REPEAT behind PGALLOC_GFP but the actual allocations
  are always order-0. __pgd_alloc is doing order-2 but this is still
  not costly allocation
* arm64: does basically the same except it can have PGD_SIZE != PAGE_SIZE
  so let's keep __GFP_REPEAT explicit there
* mips, nios2, parisc, score, x86: hide behind PTE_ORDER and PMD_ORDER but
  both are not larger than 1 and this seems like a copy&paste between
  arches.

This is really confusing because __GFP_REPEAT is explicitly documented
to allow allocation failures which is a weaker semantic than the current
order-0 has (basically nofail).

Let's simply reap out __GFP_REPEAT from those places. This would allow
to identify place which really need allocator to retry harder and
formulate a more specific semantic for what the flag is supposed to do
actually.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/arc/include/asm/pgalloc.h    | 4 ++--
 arch/arm/include/asm/pgalloc.h    | 2 +-
 arch/arm/mm/pgd.c                 | 2 +-
 arch/arm64/include/asm/pgalloc.h  | 2 +-
 arch/arm64/mm/pgd.c               | 2 +-
 arch/mips/include/asm/pgalloc.h   | 6 +++---
 arch/nios2/include/asm/pgalloc.h  | 5 ++---
 arch/parisc/include/asm/pgalloc.h | 3 +--
 arch/score/include/asm/pgalloc.h  | 5 ++---
 arch/x86/kernel/espfix_64.c       | 2 +-
 arch/x86/mm/pgtable.c             | 2 +-
 11 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 86ed671286df..3749234b7419 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -95,7 +95,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO,
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
 					 __get_order_pte());
 
 	return pte;
@@ -107,7 +107,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 	pgtable_t pte_pg;
 	struct page *page;
 
-	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, __get_order_pte());
+	pte_pg = (pgtable_t)__get_free_pages(GFP_KERNEL, __get_order_pte());
 	if (!pte_pg)
 		return 0;
 	memzero((void *)pte_pg, PTRS_PER_PTE * sizeof(pte_t));
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 20febb368844..b2902a5cd780 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -57,7 +57,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 static inline void clean_pte_table(pte_t *pte)
 {
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index e683db1b90a3..107a2792d0a2 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -23,7 +23,7 @@
 #define __pgd_alloc()	kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
 #define __pgd_free(pgd)	kfree(pgd)
 #else
-#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_REPEAT, 2)
+#define __pgd_alloc()	(pgd_t *)__get_free_pages(GFP_KERNEL, 2)
 #define __pgd_free(pgd)	free_pages((unsigned long)pgd, 2)
 #endif
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index c15053902942..ea515f74ac02 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -26,7 +26,7 @@
 
 #define check_pgt_cache()		do { } while (0)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP	(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
 
 #if CONFIG_PGTABLE_LEVELS > 2
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index cb3ba1b812e7..77902d234498 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -35,7 +35,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PGD_SIZE == PAGE_SIZE)
 		return (pgd_t *)__get_free_page(PGALLOC_GFP);
 	else
-		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
+		return kmem_cache_alloc(pgd_cache, PGALLOC_GFP | __GFP_REPEAT);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index b336037e8768..93c079a1cfc8 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -69,7 +69,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -79,7 +79,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
@@ -113,7 +113,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
 
-	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PMD_ORDER);
+	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		pmd_init((unsigned long)pmd, (unsigned long)invalid_pte_table);
 	return pmd;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 6e2985e0a7b9..bb47d08c8ef7 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (pte) {
 		if (!pgtable_page_ctor(pte)) {
 			__free_page(pte);
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index b7e4027aac4b..bd388cfd8141 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -63,8 +63,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmd)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL|__GFP_REPEAT,
-					       PMD_ORDER);
+	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER);
 	if (pmd)
 		memset(pmd, 0, PAGE_SIZE<<PMD_ORDER);
 	return pmd;
diff --git a/arch/score/include/asm/pgalloc.h b/arch/score/include/asm/pgalloc.h
index 2e067657db98..49b012d78c1a 100644
--- a/arch/score/include/asm/pgalloc.h
+++ b/arch/score/include/asm/pgalloc.h
@@ -42,8 +42,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 {
 	pte_t *pte;
 
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO,
-					PTE_ORDER);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
 
 	return pte;
 }
@@ -53,7 +52,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 {
 	struct page *pte;
 
-	pte = alloc_pages(GFP_KERNEL | __GFP_REPEAT, PTE_ORDER);
+	pte = alloc_pages(GFP_KERNEL, PTE_ORDER);
 	if (!pte)
 		return NULL;
 	clear_highpage(pte);
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4d38416e2a7f..04f89caef9c4 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -57,7 +57,7 @@
 # error "Need more than one PGD for the ESPFIX hack"
 #endif
 
-#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f52caf9c519b..bdbb3213f670 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,7 @@
 #include <asm/fixmap.h>
 #include <asm/mtrr.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/3] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-05 16:15 ` mhocko
@ 2015-11-05 16:16   ` mhocko
  -1 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko, Theodore Ts'o

From: Michal Hocko <mhocko@suse.com>

jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator
and larger are using either the page allocator or vmalloc. This
is all good but the logic is unnecessarily complex. Requests larger
than order-3 are doing the vmalloc directly while smaller go to the
page allocator with __GFP_REPEAT. The flag doesn't do anything useful
for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.

Let's simplify the code flow and use kmalloc for sub-page requests
and the page allocator for others with fallback to vmalloc if the
allocation fails.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..630abbfa4b61 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 	BUG_ON(size & (size-1)); /* Must be a power of 2 */
 
-	flags |= __GFP_REPEAT;
-	if (size == PAGE_SIZE)
-		ptr = (void *)__get_free_pages(flags, 0);
-	else if (size > PAGE_SIZE) {
+	if (size < PAGE_SIZE)
+		ptr = kmem_cache_alloc(get_slab(size), flags);
+	else {
 		int order = get_order(size);
 
-		if (order < 3)
-			ptr = (void *)__get_free_pages(flags, order);
-		else
+		ptr = (void *)__get_free_pages(flags, order);
+		if (!ptr)
 			ptr = vmalloc(size);
-	} else
-		ptr = kmem_cache_alloc(get_slab(size), flags);
+	}
 
 	/* Check alignment; SLUB has gotten this wrong in the past,
 	 * and this can lead to user data corruption! */
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/3] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-05 16:16   ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-05 16:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko, Theodore Ts'o

From: Michal Hocko <mhocko@suse.com>

jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator
and larger are using either the page allocator or vmalloc. This
is all good but the logic is unnecessarily complex. Requests larger
than order-3 are doing the vmalloc directly while smaller go to the
page allocator with __GFP_REPEAT. The flag doesn't do anything useful
for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.

Let's simplify the code flow and use kmalloc for sub-page requests
and the page allocator for others with fallback to vmalloc if the
allocation fails.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..630abbfa4b61 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 	BUG_ON(size & (size-1)); /* Must be a power of 2 */
 
-	flags |= __GFP_REPEAT;
-	if (size == PAGE_SIZE)
-		ptr = (void *)__get_free_pages(flags, 0);
-	else if (size > PAGE_SIZE) {
+	if (size < PAGE_SIZE)
+		ptr = kmem_cache_alloc(get_slab(size), flags);
+	else {
 		int order = get_order(size);
 
-		if (order < 3)
-			ptr = (void *)__get_free_pages(flags, order);
-		else
+		ptr = (void *)__get_free_pages(flags, order);
+		if (!ptr)
 			ptr = vmalloc(size);
-	} else
-		ptr = kmem_cache_alloc(get_slab(size), flags);
+	}
 
 	/* Check alignment; SLUB has gotten this wrong in the past,
 	 * and this can lead to user data corruption! */
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-05 16:16   ` mhocko
@ 2015-11-06 16:17     ` mhocko
  -1 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-06 16:17 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Theodore Ts'o, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator
and larger are using either the page allocator or vmalloc. This
is all good but the logic is unnecessarily complex. Requests larger
than order-3 are doing the vmalloc directly while smaller go to the
page allocator with __GFP_REPEAT. The flag doesn't do anything useful
for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.

Let's simplify the code flow and use kmalloc for sub-page requests
and the page allocator for others with fallback to vmalloc if the
allocation fails.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..2945c96f171f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 	BUG_ON(size & (size-1)); /* Must be a power of 2 */
 
-	flags |= __GFP_REPEAT;
-	if (size == PAGE_SIZE)
-		ptr = (void *)__get_free_pages(flags, 0);
-	else if (size > PAGE_SIZE) {
+	if (size < PAGE_SIZE)
+		ptr = kmem_cache_alloc(get_slab(size), flags);
+	else {
 		int order = get_order(size);
 
-		if (order < 3)
-			ptr = (void *)__get_free_pages(flags, order);
-		else
+		ptr = (void *)__get_free_pages(flags, order);
+		if (!ptr)
 			ptr = vmalloc(size);
-	} else
-		ptr = kmem_cache_alloc(get_slab(size), flags);
+	}
 
 	/* Check alignment; SLUB has gotten this wrong in the past,
 	 * and this can lead to user data corruption! */
@@ -2321,20 +2318,12 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 void jbd2_free(void *ptr, size_t size)
 {
-	if (size == PAGE_SIZE) {
-		free_pages((unsigned long)ptr, 0);
-		return;
-	}
-	if (size > PAGE_SIZE) {
-		int order = get_order(size);
-
-		if (order < 3)
-			free_pages((unsigned long)ptr, order);
-		else
-			vfree(ptr);
-		return;
-	}
-	kmem_cache_free(get_slab(size), ptr);
+	if (size < PAGE_SIZE)
+		kmem_cache_free(get_slab(size), ptr);
+	else if (is_vmalloc_addr(ptr))
+		vfree(ptr);
+	else
+		free_pages((unsigned long)ptr, get_order(size));
 };
 
 /*
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-06 16:17     ` mhocko
  0 siblings, 0 replies; 39+ messages in thread
From: mhocko @ 2015-11-06 16:17 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Theodore Ts'o, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator
and larger are using either the page allocator or vmalloc. This
is all good but the logic is unnecessarily complex. Requests larger
than order-3 are doing the vmalloc directly while smaller go to the
page allocator with __GFP_REPEAT. The flag doesn't do anything useful
for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.

Let's simplify the code flow and use kmalloc for sub-page requests
and the page allocator for others with fallback to vmalloc if the
allocation fails.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..2945c96f171f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 	BUG_ON(size & (size-1)); /* Must be a power of 2 */
 
-	flags |= __GFP_REPEAT;
-	if (size == PAGE_SIZE)
-		ptr = (void *)__get_free_pages(flags, 0);
-	else if (size > PAGE_SIZE) {
+	if (size < PAGE_SIZE)
+		ptr = kmem_cache_alloc(get_slab(size), flags);
+	else {
 		int order = get_order(size);
 
-		if (order < 3)
-			ptr = (void *)__get_free_pages(flags, order);
-		else
+		ptr = (void *)__get_free_pages(flags, order);
+		if (!ptr)
 			ptr = vmalloc(size);
-	} else
-		ptr = kmem_cache_alloc(get_slab(size), flags);
+	}
 
 	/* Check alignment; SLUB has gotten this wrong in the past,
 	 * and this can lead to user data corruption! */
@@ -2321,20 +2318,12 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 void jbd2_free(void *ptr, size_t size)
 {
-	if (size == PAGE_SIZE) {
-		free_pages((unsigned long)ptr, 0);
-		return;
-	}
-	if (size > PAGE_SIZE) {
-		int order = get_order(size);
-
-		if (order < 3)
-			free_pages((unsigned long)ptr, order);
-		else
-			vfree(ptr);
-		return;
-	}
-	kmem_cache_free(get_slab(size), ptr);
+	if (size < PAGE_SIZE)
+		kmem_cache_free(get_slab(size), ptr);
+	else if (is_vmalloc_addr(ptr))
+		vfree(ptr);
+	else
+		free_pages((unsigned long)ptr, get_order(size));
 };
 
 /*
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-06 16:17     ` mhocko
@ 2015-11-07  1:22       ` Tetsuo Handa
  -1 siblings, 0 replies; 39+ messages in thread
From: Tetsuo Handa @ 2015-11-07  1:22 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: Andrew Morton, Theodore Ts'o, LKML, Michal Hocko, john.johansen

On 2015/11/07 1:17, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> jbd2_alloc is explicit about its allocation preferences wrt. the
> allocation size. Sub page allocations go to the slab allocator
> and larger are using either the page allocator or vmalloc. This
> is all good but the logic is unnecessarily complex. Requests larger
> than order-3 are doing the vmalloc directly while smaller go to the
> page allocator with __GFP_REPEAT. The flag doesn't do anything useful
> for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.
>
> Let's simplify the code flow and use kmalloc for sub-page requests
> and the page allocator for others with fallback to vmalloc if the
> allocation fails.
>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   fs/jbd2/journal.c | 35 ++++++++++++-----------------------
>   1 file changed, 12 insertions(+), 23 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 81e622681c82..2945c96f171f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>
>   	BUG_ON(size & (size-1)); /* Must be a power of 2 */
>
> -	flags |= __GFP_REPEAT;
> -	if (size == PAGE_SIZE)
> -		ptr = (void *)__get_free_pages(flags, 0);
> -	else if (size > PAGE_SIZE) {
> +	if (size < PAGE_SIZE)
> +		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	else {
>   		int order = get_order(size);
>
> -		if (order < 3)
> -			ptr = (void *)__get_free_pages(flags, order);
> -		else
> +		ptr = (void *)__get_free_pages(flags, order);

I thought that we can add __GFP_NOWARN for this __get_free_pages() call.
But I noticed more important problem. See below.

> +		if (!ptr)
>   			ptr = vmalloc(size);
> -	} else
> -		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	}
>
>   	/* Check alignment; SLUB has gotten this wrong in the past,
>   	 * and this can lead to user data corruption! */
> @@ -2321,20 +2318,12 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>
>   void jbd2_free(void *ptr, size_t size)
>   {
> -	if (size == PAGE_SIZE) {
> -		free_pages((unsigned long)ptr, 0);
> -		return;
> -	}
> -	if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			free_pages((unsigned long)ptr, order);
> -		else
> -			vfree(ptr);
> -		return;
> -	}
> -	kmem_cache_free(get_slab(size), ptr);
> +	if (size < PAGE_SIZE)
> +		kmem_cache_free(get_slab(size), ptr);
> +	else if (is_vmalloc_addr(ptr))
> +		vfree(ptr);
> +	else
> +		free_pages((unsigned long)ptr, get_order(size));
>   };
>
>   /*
>

All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
deadlock, can't it? This vmalloc(size) call needs to be replaced with
__vmalloc(size, flags).

We need to check all vmalloc() callers in case they are calling vmalloc()
under GFP_KERNEL-unsafe context. For example, I think that __aa_kvmalloc()
needs to use __vmalloc() too.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-07  1:22       ` Tetsuo Handa
  0 siblings, 0 replies; 39+ messages in thread
From: Tetsuo Handa @ 2015-11-07  1:22 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: Andrew Morton, Theodore Ts'o, LKML, Michal Hocko, john.johansen

On 2015/11/07 1:17, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> jbd2_alloc is explicit about its allocation preferences wrt. the
> allocation size. Sub page allocations go to the slab allocator
> and larger are using either the page allocator or vmalloc. This
> is all good but the logic is unnecessarily complex. Requests larger
> than order-3 are doing the vmalloc directly while smaller go to the
> page allocator with __GFP_REPEAT. The flag doesn't do anything useful
> for those because they are smaller than PAGE_ALLOC_COSTLY_ORDER.
>
> Let's simplify the code flow and use kmalloc for sub-page requests
> and the page allocator for others with fallback to vmalloc if the
> allocation fails.
>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   fs/jbd2/journal.c | 35 ++++++++++++-----------------------
>   1 file changed, 12 insertions(+), 23 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 81e622681c82..2945c96f171f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2299,18 +2299,15 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>
>   	BUG_ON(size & (size-1)); /* Must be a power of 2 */
>
> -	flags |= __GFP_REPEAT;
> -	if (size == PAGE_SIZE)
> -		ptr = (void *)__get_free_pages(flags, 0);
> -	else if (size > PAGE_SIZE) {
> +	if (size < PAGE_SIZE)
> +		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	else {
>   		int order = get_order(size);
>
> -		if (order < 3)
> -			ptr = (void *)__get_free_pages(flags, order);
> -		else
> +		ptr = (void *)__get_free_pages(flags, order);

I thought that we can add __GFP_NOWARN for this __get_free_pages() call.
But I noticed more important problem. See below.

> +		if (!ptr)
>   			ptr = vmalloc(size);
> -	} else
> -		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	}
>
>   	/* Check alignment; SLUB has gotten this wrong in the past,
>   	 * and this can lead to user data corruption! */
> @@ -2321,20 +2318,12 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>
>   void jbd2_free(void *ptr, size_t size)
>   {
> -	if (size == PAGE_SIZE) {
> -		free_pages((unsigned long)ptr, 0);
> -		return;
> -	}
> -	if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			free_pages((unsigned long)ptr, order);
> -		else
> -			vfree(ptr);
> -		return;
> -	}
> -	kmem_cache_free(get_slab(size), ptr);
> +	if (size < PAGE_SIZE)
> +		kmem_cache_free(get_slab(size), ptr);
> +	else if (is_vmalloc_addr(ptr))
> +		vfree(ptr);
> +	else
> +		free_pages((unsigned long)ptr, get_order(size));
>   };
>
>   /*
>

All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
deadlock, can't it? This vmalloc(size) call needs to be replaced with
__vmalloc(size, flags).

We need to check all vmalloc() callers in case they are calling vmalloc()
under GFP_KERNEL-unsafe context. For example, I think that __aa_kvmalloc()
needs to use __vmalloc() too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-07  1:22       ` Tetsuo Handa
@ 2015-11-08  5:08         ` Theodore Ts'o
  -1 siblings, 0 replies; 39+ messages in thread
From: Theodore Ts'o @ 2015-11-08  5:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, linux-mm, Andrew Morton, LKML, Michal Hocko, john.johansen

On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> deadlock, can't it? This vmalloc(size) call needs to be replaced with
> __vmalloc(size, flags).

jbd2_alloc is only passed in the bh->b_size, which can't be >
PAGE_SIZE, so the code path that calls vmalloc() should never get
called.  When we conveted jbd2_alloc() to suppor sub-page size
allocations in commit d2eecb039368, there was an assumption that it
could be called with a size greater than PAGE_SIZE, but that's
certaily not true today.

					- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-08  5:08         ` Theodore Ts'o
  0 siblings, 0 replies; 39+ messages in thread
From: Theodore Ts'o @ 2015-11-08  5:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, linux-mm, Andrew Morton, LKML, Michal Hocko, john.johansen

On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> deadlock, can't it? This vmalloc(size) call needs to be replaced with
> __vmalloc(size, flags).

jbd2_alloc is only passed in the bh->b_size, which can't be >
PAGE_SIZE, so the code path that calls vmalloc() should never get
called.  When we conveted jbd2_alloc() to suppor sub-page size
allocations in commit d2eecb039368, there was an assumption that it
could be called with a size greater than PAGE_SIZE, but that's
certaily not true today.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-08  5:08         ` Theodore Ts'o
@ 2015-11-09  8:16           ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-09  8:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Sun 08-11-15 00:08:02, Theodore Ts'o wrote:
> On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> > All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> > vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> > deadlock, can't it? This vmalloc(size) call needs to be replaced with
> > __vmalloc(size, flags).
> 
> jbd2_alloc is only passed in the bh->b_size, which can't be >
> PAGE_SIZE, so the code path that calls vmalloc() should never get
> called.  When we conveted jbd2_alloc() to suppor sub-page size
> allocations in commit d2eecb039368, there was an assumption that it
> could be called with a size greater than PAGE_SIZE, but that's
> certaily not true today.

Thanks for the clarification. Then the patch can be simplified even
more then.
---
>From fbf02c347dae8ee86e396bc769a88e85773db83e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 21 Oct 2015 17:14:49 +0200
Subject: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT

jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator
and larger are using either the page allocator or vmalloc. This
is all good but the logic is unnecessarily complex.
1) as per Ted, the vmalloc fallback is a left-over:
: jbd2_alloc is only passed in the bh->b_size, which can't be >
: PAGE_SIZE, so the code path that calls vmalloc() should never get
: called.  When we conveted jbd2_alloc() to suppor sub-page size
: allocations in commit d2eecb039368, there was an assumption that it
: could be called with a size greater than PAGE_SIZE, but that's
: certaily not true today.
Moreover vmalloc allocation might even lead to a deadlock because
the callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
2) Requests smaller than order-3 are go to the page allocator with
__GFP_REPEAT. The flag doesn't do anything useful for those because they
are smaller than PAGE_ALLOC_COSTLY_ORDER.

Let's simplify the code flow and use the slab allocator for sub-page
requests and the page allocator for others. Even though order > 0 is
not currently used as per above leave that option open.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 32 +++++++-------------------------
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..0145e7978ab4 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2299,18 +2299,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 	BUG_ON(size & (size-1)); /* Must be a power of 2 */
 
-	flags |= __GFP_REPEAT;
-	if (size == PAGE_SIZE)
-		ptr = (void *)__get_free_pages(flags, 0);
-	else if (size > PAGE_SIZE) {
-		int order = get_order(size);
-
-		if (order < 3)
-			ptr = (void *)__get_free_pages(flags, order);
-		else
-			ptr = vmalloc(size);
-	} else
+	if (size < PAGE_SIZE)
 		ptr = kmem_cache_alloc(get_slab(size), flags);
+	else
+		ptr = (void *)__get_free_pages(flags, get_order(size));
 
 	/* Check alignment; SLUB has gotten this wrong in the past,
 	 * and this can lead to user data corruption! */
@@ -2321,20 +2313,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
 
 void jbd2_free(void *ptr, size_t size)
 {
-	if (size == PAGE_SIZE) {
-		free_pages((unsigned long)ptr, 0);
-		return;
-	}
-	if (size > PAGE_SIZE) {
-		int order = get_order(size);
-
-		if (order < 3)
-			free_pages((unsigned long)ptr, order);
-		else
-			vfree(ptr);
-		return;
-	}
-	kmem_cache_free(get_slab(size), ptr);
+	if (size < PAGE_SIZE)
+		kmem_cache_free(get_slab(size), ptr);
+	else
+		free_pages((unsigned long)ptr, get_order(size));
 };
 
 /*
-- 
2.6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-09  8:16           ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-09  8:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Sun 08-11-15 00:08:02, Theodore Ts'o wrote:
> On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> > All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> > vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> > deadlock, can't it? This vmalloc(size) call needs to be replaced with
> > __vmalloc(size, flags).
> 
> jbd2_alloc is only passed in the bh->b_size, which can't be >
> PAGE_SIZE, so the code path that calls vmalloc() should never get
> called.  When we conveted jbd2_alloc() to suppor sub-page size
> allocations in commit d2eecb039368, there was an assumption that it
> could be called with a size greater than PAGE_SIZE, but that's
> certaily not true today.

Thanks for the clarification. Then the patch can be simplified even
more then.
---

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-05 16:15   ` mhocko
@ 2015-11-09 22:04     ` Vlastimil Babka
  -1 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-09 22:04 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko

On 5.11.2015 17:15, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __GFP_REPEAT has a rather weak semantic but since it has been introduced
> around 2.6.12 it has been ignored for low order allocations. Yet we have
> the full kernel tree with its usage for apparently order-0 allocations.
> This is really confusing because __GFP_REPEAT is explicitly documented
> to allow allocation failures which is a weaker semantic than the current
> order-0 has (basically nofail).
> 
> Let's simply reap out __GFP_REPEAT from those places. This would allow
> to identify place which really need allocator to retry harder and
> formulate a more specific semantic for what the flag is supposed to do
> actually.

So at first I thought "yeah that's obvious", but then after some more thinking,
I'm not so sure anymore.

I think we should formulate the semantic first, then do any changes. Also, let's
look at the flag description (which comes from pre-git):

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

So we say it's implementation detail, and IIRC the same is said about which
orders are considered costly and which not, and the associated rules. So, can we
blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
current implementation? And is it a problem that they do that?

So I think we should answer the following questions:

* What is the semantic of __GFP_REPEAT?
  - My suggestion would be something like "I would really like this allocation
to succeed. I still have some fallback but it's so suboptimal I'd rather wait
for this allocation." And then we could e.g. change some heuristics to take that
into account - e.g. direct compaction could ignore the deferred state and
pageblock skip bits, to make sure it's as thorough as possible. Right now, that
sort of happens, but not quite - given enough reclaim/compact attempts, the
compact attempts might break out of deferred state. But pages_reclaimed might
reach 1 << order before compaction "undefers", and then it breaks out of the
loop. Is any such heuristic change possible for reclaim as well?
As part of this question we should also keep in mind/rethink __GFP_NORETRY as
that's supposed to be the opposite flag to __GFP_REPEAT.

* Can it ever happen that __GFP_REPEAT could make some difference for order-0?
  - Certainly not wrt compaction, how about reclaim?
  - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.

* Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?
  - I would think so, and if yes, then we probably shouldn't remove
__GFP_NORETRY for order-1+ allocations that happen to be not costly in the
current implementation?





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-09 22:04     ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-09 22:04 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: Andrew Morton, LKML, Michal Hocko

On 5.11.2015 17:15, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __GFP_REPEAT has a rather weak semantic but since it has been introduced
> around 2.6.12 it has been ignored for low order allocations. Yet we have
> the full kernel tree with its usage for apparently order-0 allocations.
> This is really confusing because __GFP_REPEAT is explicitly documented
> to allow allocation failures which is a weaker semantic than the current
> order-0 has (basically nofail).
> 
> Let's simply reap out __GFP_REPEAT from those places. This would allow
> to identify place which really need allocator to retry harder and
> formulate a more specific semantic for what the flag is supposed to do
> actually.

So at first I thought "yeah that's obvious", but then after some more thinking,
I'm not so sure anymore.

I think we should formulate the semantic first, then do any changes. Also, let's
look at the flag description (which comes from pre-git):

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

So we say it's implementation detail, and IIRC the same is said about which
orders are considered costly and which not, and the associated rules. So, can we
blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
current implementation? And is it a problem that they do that?

So I think we should answer the following questions:

* What is the semantic of __GFP_REPEAT?
  - My suggestion would be something like "I would really like this allocation
to succeed. I still have some fallback but it's so suboptimal I'd rather wait
for this allocation." And then we could e.g. change some heuristics to take that
into account - e.g. direct compaction could ignore the deferred state and
pageblock skip bits, to make sure it's as thorough as possible. Right now, that
sort of happens, but not quite - given enough reclaim/compact attempts, the
compact attempts might break out of deferred state. But pages_reclaimed might
reach 1 << order before compaction "undefers", and then it breaks out of the
loop. Is any such heuristic change possible for reclaim as well?
As part of this question we should also keep in mind/rethink __GFP_NORETRY as
that's supposed to be the opposite flag to __GFP_REPEAT.

* Can it ever happen that __GFP_REPEAT could make some difference for order-0?
  - Certainly not wrt compaction, how about reclaim?
  - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.

* Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?
  - I would think so, and if yes, then we probably shouldn't remove
__GFP_NORETRY for order-1+ allocations that happen to be not costly in the
current implementation?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-09 22:04     ` Vlastimil Babka
@ 2015-11-10 12:51       ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-10 12:51 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
> On 5.11.2015 17:15, mhocko@kernel.org wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > around 2.6.12 it has been ignored for low order allocations. Yet we have
> > the full kernel tree with its usage for apparently order-0 allocations.
> > This is really confusing because __GFP_REPEAT is explicitly documented
> > to allow allocation failures which is a weaker semantic than the current
> > order-0 has (basically nofail).
> > 
> > Let's simply reap out __GFP_REPEAT from those places. This would allow
> > to identify place which really need allocator to retry harder and
> > formulate a more specific semantic for what the flag is supposed to do
> > actually.
> 
> So at first I thought "yeah that's obvious", but then after some more thinking,
> I'm not so sure anymore.

Thanks for looking into this! The primary purpose of this patch series was
to start the discussion. I've only now realized I forgot to add RFC, sorry
about that.

> I think we should formulate the semantic first, then do any changes. Also, let's
> look at the flag description (which comes from pre-git):

It's rather hard to formulate one without examining the current users...

>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
> 
> So we say it's implementation detail, and IIRC the same is said about which
> orders are considered costly and which not, and the associated rules. So, can we
> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
> current implementation? And is it a problem that they do that?

Well, I think that many users simply copy&pasted the code along with the
flag. I have failed to find any justification for adding this flag for
basically all the cases I've checked.

My understanding is that the overal motivation for the flag was to
fortify the allocation requests rather than weaken them. But if we were
literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
orders. It is true that the later one is so only implicitly - and as an
implementation detail.

Anyway I think that getting rid of those users which couldn't ever see a
difference is a good start.

> So I think we should answer the following questions:
> 
> * What is the semantic of __GFP_REPEAT?
>   - My suggestion would be something like "I would really like this allocation
> to succeed. I still have some fallback but it's so suboptimal I'd rather wait
> for this allocation."

This is very close to the current wording.

> And then we could e.g. change some heuristics to take that
> into account - e.g. direct compaction could ignore the deferred state and
> pageblock skip bits, to make sure it's as thorough as possible. Right now, that
> sort of happens, but not quite - given enough reclaim/compact attempts, the
> compact attempts might break out of deferred state. But pages_reclaimed might
> reach 1 << order before compaction "undefers", and then it breaks out of the
> loop. Is any such heuristic change possible for reclaim as well?

I am not familiar with the compaction code enough to comment on this but
the reclaim part is already having something in should_continue_reclaim.
For low order allocations this doesn't make too much of a difference
because the reclaim is retried anyway from the page allocator path.

> As part of this question we should also keep in mind/rethink __GFP_NORETRY as
> that's supposed to be the opposite flag to __GFP_REPEAT.
> 
> * Can it ever happen that __GFP_REPEAT could make some difference for order-0?

Yes, if you want to try hard but eventually allow to fail the request.
Why just not use __GFP_NORETRY for that purpose? Well, that context is
much weaker. We give up after the first round of the reclaim and we do
not trigger the OOM killer in that context. __GFP_REPEAT should be about
finit retrying.

I am pretty sure there are users who would like to have this semantic.
None of the current low-order users seem to fall into this cathegory
AFAICS though.

>   - Certainly not wrt compaction, how about reclaim?

We can decide to allow the allocation to fail if reclaim progress was
not sufficient _and_ the OOM killer haven't helped rather than start
looping again.

>   - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.

I've split up obviously order-0 from the rest because I think order-0
are really easy to understand. Patch 1 is a bit harder to grasp but I
think it should be safe as well. I am open to discussion of course.

> * Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?

Yes, more or less, but I doubt we can change it much considering all the
legacy code which might rely on it. I think we should simply remove the
dependency on the order and act for all orders same semantically.

>   - I would think so, and if yes, then we probably shouldn't remove
> __GFP_NORETRY for order-1+ allocations that happen to be not costly in the
> current implementation?

I guess you meant __GFP_REPEAT here but even then we should really think
about the reason why the flag has been added. Is it to fortify the
request? If yes it never worked that way so it is hard to justify it
that way.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-10 12:51       ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-10 12:51 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
> On 5.11.2015 17:15, mhocko@kernel.org wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> > around 2.6.12 it has been ignored for low order allocations. Yet we have
> > the full kernel tree with its usage for apparently order-0 allocations.
> > This is really confusing because __GFP_REPEAT is explicitly documented
> > to allow allocation failures which is a weaker semantic than the current
> > order-0 has (basically nofail).
> > 
> > Let's simply reap out __GFP_REPEAT from those places. This would allow
> > to identify place which really need allocator to retry harder and
> > formulate a more specific semantic for what the flag is supposed to do
> > actually.
> 
> So at first I thought "yeah that's obvious", but then after some more thinking,
> I'm not so sure anymore.

Thanks for looking into this! The primary purpose of this patch series was
to start the discussion. I've only now realized I forgot to add RFC, sorry
about that.

> I think we should formulate the semantic first, then do any changes. Also, let's
> look at the flag description (which comes from pre-git):

It's rather hard to formulate one without examining the current users...

>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
> 
> So we say it's implementation detail, and IIRC the same is said about which
> orders are considered costly and which not, and the associated rules. So, can we
> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
> current implementation? And is it a problem that they do that?

Well, I think that many users simply copy&pasted the code along with the
flag. I have failed to find any justification for adding this flag for
basically all the cases I've checked.

My understanding is that the overal motivation for the flag was to
fortify the allocation requests rather than weaken them. But if we were
literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
orders. It is true that the later one is so only implicitly - and as an
implementation detail.

Anyway I think that getting rid of those users which couldn't ever see a
difference is a good start.

> So I think we should answer the following questions:
> 
> * What is the semantic of __GFP_REPEAT?
>   - My suggestion would be something like "I would really like this allocation
> to succeed. I still have some fallback but it's so suboptimal I'd rather wait
> for this allocation."

This is very close to the current wording.

> And then we could e.g. change some heuristics to take that
> into account - e.g. direct compaction could ignore the deferred state and
> pageblock skip bits, to make sure it's as thorough as possible. Right now, that
> sort of happens, but not quite - given enough reclaim/compact attempts, the
> compact attempts might break out of deferred state. But pages_reclaimed might
> reach 1 << order before compaction "undefers", and then it breaks out of the
> loop. Is any such heuristic change possible for reclaim as well?

I am not familiar with the compaction code enough to comment on this but
the reclaim part is already having something in should_continue_reclaim.
For low order allocations this doesn't make too much of a difference
because the reclaim is retried anyway from the page allocator path.

> As part of this question we should also keep in mind/rethink __GFP_NORETRY as
> that's supposed to be the opposite flag to __GFP_REPEAT.
> 
> * Can it ever happen that __GFP_REPEAT could make some difference for order-0?

Yes, if you want to try hard but eventually allow to fail the request.
Why just not use __GFP_NORETRY for that purpose? Well, that context is
much weaker. We give up after the first round of the reclaim and we do
not trigger the OOM killer in that context. __GFP_REPEAT should be about
finit retrying.

I am pretty sure there are users who would like to have this semantic.
None of the current low-order users seem to fall into this cathegory
AFAICS though.

>   - Certainly not wrt compaction, how about reclaim?

We can decide to allow the allocation to fail if reclaim progress was
not sufficient _and_ the OOM killer haven't helped rather than start
looping again.

>   - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.

I've split up obviously order-0 from the rest because I think order-0
are really easy to understand. Patch 1 is a bit harder to grasp but I
think it should be safe as well. I am open to discussion of course.

> * Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?

Yes, more or less, but I doubt we can change it much considering all the
legacy code which might rely on it. I think we should simply remove the
dependency on the order and act for all orders same semantically.

>   - I would think so, and if yes, then we probably shouldn't remove
> __GFP_NORETRY for order-1+ allocations that happen to be not costly in the
> current implementation?

I guess you meant __GFP_REPEAT here but even then we should really think
about the reason why the flag has been added. Is it to fortify the
request? If yes it never worked that way so it is hard to justify it
that way.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-10 12:51       ` Michal Hocko
@ 2015-11-18 14:15         ` Vlastimil Babka
  -1 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-18 14:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 11/10/2015 01:51 PM, Michal Hocko wrote:
> On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
>> On 5.11.2015 17:15, mhocko@kernel.org wrote:
>> > From: Michal Hocko <mhocko@suse.com>
>> > 
>> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
>> > around 2.6.12 it has been ignored for low order allocations. Yet we have
>> > the full kernel tree with its usage for apparently order-0 allocations.
>> > This is really confusing because __GFP_REPEAT is explicitly documented
>> > to allow allocation failures which is a weaker semantic than the current
>> > order-0 has (basically nofail).
>> > 
>> > Let's simply reap out __GFP_REPEAT from those places. This would allow
>> > to identify place which really need allocator to retry harder and
>> > formulate a more specific semantic for what the flag is supposed to do
>> > actually.
>> 
>> So at first I thought "yeah that's obvious", but then after some more thinking,
>> I'm not so sure anymore.
> 
> Thanks for looking into this! The primary purpose of this patch series was
> to start the discussion. I've only now realized I forgot to add RFC, sorry
> about that.
> 
>> I think we should formulate the semantic first, then do any changes. Also, let's
>> look at the flag description (which comes from pre-git):
> 
> It's rather hard to formulate one without examining the current users...

Sure, but changing existing users is a different thing :)

>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>  * _might_ fail.  This depends upon the particular VM implementation.
>> 
>> So we say it's implementation detail, and IIRC the same is said about which
>> orders are considered costly and which not, and the associated rules. So, can we
>> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
>> current implementation? And is it a problem that they do that?
> 
> Well, I think that many users simply copy&pasted the code along with the
> flag. I have failed to find any justification for adding this flag for
> basically all the cases I've checked.
> 
> My understanding is that the overal motivation for the flag was to
> fortify the allocation requests rather than weaken them. But if we were
> literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
> orders. It is true that the later one is so only implicitly - and as an
> implementation detail.

OK I admit I didn't realize fully that __GFP_REPEAT is supposed to be weaker,
although you did write it quite explicitly in the changelog. It's just
completely counterintuitive given the name of the flag!

> Anyway I think that getting rid of those users which couldn't ever see a
> difference is a good start.
> 
>> So I think we should answer the following questions:
>> 
>> * What is the semantic of __GFP_REPEAT?
>>   - My suggestion would be something like "I would really like this allocation
>> to succeed. I still have some fallback but it's so suboptimal I'd rather wait
>> for this allocation."
> 
> This is very close to the current wording.
> 
>> And then we could e.g. change some heuristics to take that
>> into account - e.g. direct compaction could ignore the deferred state and
>> pageblock skip bits, to make sure it's as thorough as possible. Right now, that
>> sort of happens, but not quite - given enough reclaim/compact attempts, the
>> compact attempts might break out of deferred state. But pages_reclaimed might
>> reach 1 << order before compaction "undefers", and then it breaks out of the
>> loop. Is any such heuristic change possible for reclaim as well?
> 
> I am not familiar with the compaction code enough to comment on this but
> the reclaim part is already having something in should_continue_reclaim.

Yeah in this function having the flag means to try harder. Which is
counterintuitive to the "allow failure" semantics.

> For low order allocations this doesn't make too much of a difference
> because the reclaim is retried anyway from the page allocator path.
> 
>> As part of this question we should also keep in mind/rethink __GFP_NORETRY as
>> that's supposed to be the opposite flag to __GFP_REPEAT.
>> 
>> * Can it ever happen that __GFP_REPEAT could make some difference for order-0?
> 
> Yes, if you want to try hard but eventually allow to fail the request.

Right, I missed the "eventually fail" part.

> Why just not use __GFP_NORETRY for that purpose? Well, that context is
> much weaker. We give up after the first round of the reclaim and we do
> not trigger the OOM killer in that context. __GFP_REPEAT should be about
> finit retrying.
> 
> I am pretty sure there are users who would like to have this semantic.
> None of the current low-order users seem to fall into this cathegory
> AFAICS though.
> 
>>   - Certainly not wrt compaction, how about reclaim?
> 
> We can decide to allow the allocation to fail if reclaim progress was
> not sufficient _and_ the OOM killer haven't helped rather than start
> looping again.

Makes sense.

>>   - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.
> 
> I've split up obviously order-0 from the rest because I think order-0
> are really easy to understand. Patch 1 is a bit harder to grasp but I
> think it should be safe as well. I am open to discussion of course.
> 
>> * Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?
> 
> Yes, more or less, but I doubt we can change it much considering all the
> legacy code which might rely on it. I think we should simply remove the
> dependency on the order and act for all orders same semantically.

That's a nice goal, but probably a long-term one? Looks like making failure
possibility the default isn't viable. So that leaves us with making non-failure
the default. Which means checking all currently-costly-oder allocating sites to
pass some of the flags allowing failure.

>>   - I would think so, and if yes, then we probably shouldn't remove
>> __GFP_NORETRY for order-1+ allocations that happen to be not costly in the
>> current implementation?
> 
> I guess you meant __GFP_REPEAT here

Ah, yes.

> but even then we should really think
> about the reason why the flag has been added. Is it to fortify the
> request? If yes it never worked that way so it is hard to justify it
> that way.

Yep, we should definitely think of a less misleading name.

> Thanks!
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-18 14:15         ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-18 14:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 11/10/2015 01:51 PM, Michal Hocko wrote:
> On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
>> On 5.11.2015 17:15, mhocko@kernel.org wrote:
>> > From: Michal Hocko <mhocko@suse.com>
>> > 
>> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
>> > around 2.6.12 it has been ignored for low order allocations. Yet we have
>> > the full kernel tree with its usage for apparently order-0 allocations.
>> > This is really confusing because __GFP_REPEAT is explicitly documented
>> > to allow allocation failures which is a weaker semantic than the current
>> > order-0 has (basically nofail).
>> > 
>> > Let's simply reap out __GFP_REPEAT from those places. This would allow
>> > to identify place which really need allocator to retry harder and
>> > formulate a more specific semantic for what the flag is supposed to do
>> > actually.
>> 
>> So at first I thought "yeah that's obvious", but then after some more thinking,
>> I'm not so sure anymore.
> 
> Thanks for looking into this! The primary purpose of this patch series was
> to start the discussion. I've only now realized I forgot to add RFC, sorry
> about that.
> 
>> I think we should formulate the semantic first, then do any changes. Also, let's
>> look at the flag description (which comes from pre-git):
> 
> It's rather hard to formulate one without examining the current users...

Sure, but changing existing users is a different thing :)

>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>  * _might_ fail.  This depends upon the particular VM implementation.
>> 
>> So we say it's implementation detail, and IIRC the same is said about which
>> orders are considered costly and which not, and the associated rules. So, can we
>> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
>> current implementation? And is it a problem that they do that?
> 
> Well, I think that many users simply copy&pasted the code along with the
> flag. I have failed to find any justification for adding this flag for
> basically all the cases I've checked.
> 
> My understanding is that the overal motivation for the flag was to
> fortify the allocation requests rather than weaken them. But if we were
> literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
> orders. It is true that the later one is so only implicitly - and as an
> implementation detail.

OK I admit I didn't realize fully that __GFP_REPEAT is supposed to be weaker,
although you did write it quite explicitly in the changelog. It's just
completely counterintuitive given the name of the flag!

> Anyway I think that getting rid of those users which couldn't ever see a
> difference is a good start.
> 
>> So I think we should answer the following questions:
>> 
>> * What is the semantic of __GFP_REPEAT?
>>   - My suggestion would be something like "I would really like this allocation
>> to succeed. I still have some fallback but it's so suboptimal I'd rather wait
>> for this allocation."
> 
> This is very close to the current wording.
> 
>> And then we could e.g. change some heuristics to take that
>> into account - e.g. direct compaction could ignore the deferred state and
>> pageblock skip bits, to make sure it's as thorough as possible. Right now, that
>> sort of happens, but not quite - given enough reclaim/compact attempts, the
>> compact attempts might break out of deferred state. But pages_reclaimed might
>> reach 1 << order before compaction "undefers", and then it breaks out of the
>> loop. Is any such heuristic change possible for reclaim as well?
> 
> I am not familiar with the compaction code enough to comment on this but
> the reclaim part is already having something in should_continue_reclaim.

Yeah in this function having the flag means to try harder. Which is
counterintuitive to the "allow failure" semantics.

> For low order allocations this doesn't make too much of a difference
> because the reclaim is retried anyway from the page allocator path.
> 
>> As part of this question we should also keep in mind/rethink __GFP_NORETRY as
>> that's supposed to be the opposite flag to __GFP_REPEAT.
>> 
>> * Can it ever happen that __GFP_REPEAT could make some difference for order-0?
> 
> Yes, if you want to try hard but eventually allow to fail the request.

Right, I missed the "eventually fail" part.

> Why just not use __GFP_NORETRY for that purpose? Well, that context is
> much weaker. We give up after the first round of the reclaim and we do
> not trigger the OOM killer in that context. __GFP_REPEAT should be about
> finit retrying.
> 
> I am pretty sure there are users who would like to have this semantic.
> None of the current low-order users seem to fall into this cathegory
> AFAICS though.
> 
>>   - Certainly not wrt compaction, how about reclaim?
> 
> We can decide to allow the allocation to fail if reclaim progress was
> not sufficient _and_ the OOM killer haven't helped rather than start
> looping again.

Makes sense.

>>   - If it couldn't possibly affect order-0, then yeah proceed with Patch 1.
> 
> I've split up obviously order-0 from the rest because I think order-0
> are really easy to understand. Patch 1 is a bit harder to grasp but I
> think it should be safe as well. I am open to discussion of course.
> 
>> * Is PAGE_ALLOC_COSTLY_ORDER considered an implementation detail?
> 
> Yes, more or less, but I doubt we can change it much considering all the
> legacy code which might rely on it. I think we should simply remove the
> dependency on the order and act for all orders same semantically.

That's a nice goal, but probably a long-term one? Looks like making failure
possibility the default isn't viable. So that leaves us with making non-failure
the default. Which means checking all currently-costly-oder allocating sites to
pass some of the flags allowing failure.

>>   - I would think so, and if yes, then we probably shouldn't remove
>> __GFP_NORETRY for order-1+ allocations that happen to be not costly in the
>> current implementation?
> 
> I guess you meant __GFP_REPEAT here

Ah, yes.

> but even then we should really think
> about the reason why the flag has been added. Is it to fortify the
> request? If yes it never worked that way so it is hard to justify it
> that way.

Yep, we should definitely think of a less misleading name.

> Thanks!
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-09  8:16           ` Michal Hocko
@ 2015-11-26 15:10             ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-26 15:10 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

Hi Ted,
are there any objections for the patch or should I just repost it?

On Mon 09-11-15 09:16:50, Michal Hocko wrote:
> On Sun 08-11-15 00:08:02, Theodore Ts'o wrote:
> > On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> > > All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> > > vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> > > deadlock, can't it? This vmalloc(size) call needs to be replaced with
> > > __vmalloc(size, flags).
> > 
> > jbd2_alloc is only passed in the bh->b_size, which can't be >
> > PAGE_SIZE, so the code path that calls vmalloc() should never get
> > called.  When we conveted jbd2_alloc() to suppor sub-page size
> > allocations in commit d2eecb039368, there was an assumption that it
> > could be called with a size greater than PAGE_SIZE, but that's
> > certaily not true today.
> 
> Thanks for the clarification. Then the patch can be simplified even
> more then.
> ---
> From fbf02c347dae8ee86e396bc769a88e85773db83e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 21 Oct 2015 17:14:49 +0200
> Subject: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
> 
> jbd2_alloc is explicit about its allocation preferences wrt. the
> allocation size. Sub page allocations go to the slab allocator
> and larger are using either the page allocator or vmalloc. This
> is all good but the logic is unnecessarily complex.
> 1) as per Ted, the vmalloc fallback is a left-over:
> : jbd2_alloc is only passed in the bh->b_size, which can't be >
> : PAGE_SIZE, so the code path that calls vmalloc() should never get
> : called.  When we conveted jbd2_alloc() to suppor sub-page size
> : allocations in commit d2eecb039368, there was an assumption that it
> : could be called with a size greater than PAGE_SIZE, but that's
> : certaily not true today.
> Moreover vmalloc allocation might even lead to a deadlock because
> the callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
> 2) Requests smaller than order-3 are go to the page allocator with
> __GFP_REPEAT. The flag doesn't do anything useful for those because they
> are smaller than PAGE_ALLOC_COSTLY_ORDER.
> 
> Let's simplify the code flow and use the slab allocator for sub-page
> requests and the page allocator for others. Even though order > 0 is
> not currently used as per above leave that option open.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/jbd2/journal.c | 32 +++++++-------------------------
>  1 file changed, 7 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 81e622681c82..0145e7978ab4 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2299,18 +2299,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  
>  	BUG_ON(size & (size-1)); /* Must be a power of 2 */
>  
> -	flags |= __GFP_REPEAT;
> -	if (size == PAGE_SIZE)
> -		ptr = (void *)__get_free_pages(flags, 0);
> -	else if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			ptr = (void *)__get_free_pages(flags, order);
> -		else
> -			ptr = vmalloc(size);
> -	} else
> +	if (size < PAGE_SIZE)
>  		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	else
> +		ptr = (void *)__get_free_pages(flags, get_order(size));
>  
>  	/* Check alignment; SLUB has gotten this wrong in the past,
>  	 * and this can lead to user data corruption! */
> @@ -2321,20 +2313,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  
>  void jbd2_free(void *ptr, size_t size)
>  {
> -	if (size == PAGE_SIZE) {
> -		free_pages((unsigned long)ptr, 0);
> -		return;
> -	}
> -	if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			free_pages((unsigned long)ptr, order);
> -		else
> -			vfree(ptr);
> -		return;
> -	}
> -	kmem_cache_free(get_slab(size), ptr);
> +	if (size < PAGE_SIZE)
> +		kmem_cache_free(get_slab(size), ptr);
> +	else
> +		free_pages((unsigned long)ptr, get_order(size));
>  };
>  
>  /*
> -- 
> 2.6.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-26 15:10             ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-26 15:10 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

Hi Ted,
are there any objections for the patch or should I just repost it?

On Mon 09-11-15 09:16:50, Michal Hocko wrote:
> On Sun 08-11-15 00:08:02, Theodore Ts'o wrote:
> > On Sat, Nov 07, 2015 at 10:22:55AM +0900, Tetsuo Handa wrote:
> > > All jbd2_alloc() callers seem to pass GFP_NOFS. Therefore, use of
> > > vmalloc() which implicitly passes GFP_KERNEL | __GFP_HIGHMEM can cause
> > > deadlock, can't it? This vmalloc(size) call needs to be replaced with
> > > __vmalloc(size, flags).
> > 
> > jbd2_alloc is only passed in the bh->b_size, which can't be >
> > PAGE_SIZE, so the code path that calls vmalloc() should never get
> > called.  When we conveted jbd2_alloc() to suppor sub-page size
> > allocations in commit d2eecb039368, there was an assumption that it
> > could be called with a size greater than PAGE_SIZE, but that's
> > certaily not true today.
> 
> Thanks for the clarification. Then the patch can be simplified even
> more then.
> ---
> From fbf02c347dae8ee86e396bc769a88e85773db83e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 21 Oct 2015 17:14:49 +0200
> Subject: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
> 
> jbd2_alloc is explicit about its allocation preferences wrt. the
> allocation size. Sub page allocations go to the slab allocator
> and larger are using either the page allocator or vmalloc. This
> is all good but the logic is unnecessarily complex.
> 1) as per Ted, the vmalloc fallback is a left-over:
> : jbd2_alloc is only passed in the bh->b_size, which can't be >
> : PAGE_SIZE, so the code path that calls vmalloc() should never get
> : called.  When we conveted jbd2_alloc() to suppor sub-page size
> : allocations in commit d2eecb039368, there was an assumption that it
> : could be called with a size greater than PAGE_SIZE, but that's
> : certaily not true today.
> Moreover vmalloc allocation might even lead to a deadlock because
> the callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
> 2) Requests smaller than order-3 are go to the page allocator with
> __GFP_REPEAT. The flag doesn't do anything useful for those because they
> are smaller than PAGE_ALLOC_COSTLY_ORDER.
> 
> Let's simplify the code flow and use the slab allocator for sub-page
> requests and the page allocator for others. Even though order > 0 is
> not currently used as per above leave that option open.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/jbd2/journal.c | 32 +++++++-------------------------
>  1 file changed, 7 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 81e622681c82..0145e7978ab4 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2299,18 +2299,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  
>  	BUG_ON(size & (size-1)); /* Must be a power of 2 */
>  
> -	flags |= __GFP_REPEAT;
> -	if (size == PAGE_SIZE)
> -		ptr = (void *)__get_free_pages(flags, 0);
> -	else if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			ptr = (void *)__get_free_pages(flags, order);
> -		else
> -			ptr = vmalloc(size);
> -	} else
> +	if (size < PAGE_SIZE)
>  		ptr = kmem_cache_alloc(get_slab(size), flags);
> +	else
> +		ptr = (void *)__get_free_pages(flags, get_order(size));
>  
>  	/* Check alignment; SLUB has gotten this wrong in the past,
>  	 * and this can lead to user data corruption! */
> @@ -2321,20 +2313,10 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  
>  void jbd2_free(void *ptr, size_t size)
>  {
> -	if (size == PAGE_SIZE) {
> -		free_pages((unsigned long)ptr, 0);
> -		return;
> -	}
> -	if (size > PAGE_SIZE) {
> -		int order = get_order(size);
> -
> -		if (order < 3)
> -			free_pages((unsigned long)ptr, order);
> -		else
> -			vfree(ptr);
> -		return;
> -	}
> -	kmem_cache_free(get_slab(size), ptr);
> +	if (size < PAGE_SIZE)
> +		kmem_cache_free(get_slab(size), ptr);
> +	else
> +		free_pages((unsigned long)ptr, get_order(size));
>  };
>  
>  /*
> -- 
> 2.6.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-26 15:10             ` Michal Hocko
@ 2015-11-26 20:18               ` Theodore Ts'o
  -1 siblings, 0 replies; 39+ messages in thread
From: Theodore Ts'o @ 2015-11-26 20:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Thu, Nov 26, 2015 at 04:10:17PM +0100, Michal Hocko wrote:
> Hi Ted,
> are there any objections for the patch or should I just repost it?

Sorry, when you first posted it I was crazy busy before going on a two
week vacation/pilgrimage (and in fact I'm still in Jerusalem, but I'm
going to be heading home soon).  I am starting to work on some
patches, and expect to have time to do more work on the airplane.  The
patches are queued up, and so I haven't lost them.  So feel free to
repost them if you want, but it's not necessary unless you have some
fixup to the patch.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-26 20:18               ` Theodore Ts'o
  0 siblings, 0 replies; 39+ messages in thread
From: Theodore Ts'o @ 2015-11-26 20:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Thu, Nov 26, 2015 at 04:10:17PM +0100, Michal Hocko wrote:
> Hi Ted,
> are there any objections for the patch or should I just repost it?

Sorry, when you first posted it I was crazy busy before going on a two
week vacation/pilgrimage (and in fact I'm still in Jerusalem, but I'm
going to be heading home soon).  I am starting to work on some
patches, and expect to have time to do more work on the airplane.  The
patches are queued up, and so I haven't lost them.  So feel free to
repost them if you want, but it's not necessary unless you have some
fixup to the patch.

Cheers,

							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
  2015-11-26 20:18               ` Theodore Ts'o
@ 2015-11-27  7:56                 ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-27  7:56 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Thu 26-11-15 15:18:17, Theodore Ts'o wrote:
> On Thu, Nov 26, 2015 at 04:10:17PM +0100, Michal Hocko wrote:
> > Hi Ted,
> > are there any objections for the patch or should I just repost it?
> 
> Sorry, when you first posted it I was crazy busy before going on a two
> week vacation/pilgrimage (and in fact I'm still in Jerusalem, but I'm
> going to be heading home soon).  I am starting to work on some
> patches, and expect to have time to do more work on the airplane.  The
> patches are queued up, and so I haven't lost them.  So feel free to
> repost them if you want, but it's not necessary unless you have some
> fixup to the patch.

Sure, no worry, this is nothing urgent, I just wanted to make sure it
doesn't fall thought the cracks. If you have the latest one I was
replying to then everything should be ok.

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] jbd2: get rid of superfluous __GFP_REPEAT
@ 2015-11-27  7:56                 ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-27  7:56 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, linux-mm, Andrew Morton, LKML, john.johansen

On Thu 26-11-15 15:18:17, Theodore Ts'o wrote:
> On Thu, Nov 26, 2015 at 04:10:17PM +0100, Michal Hocko wrote:
> > Hi Ted,
> > are there any objections for the patch or should I just repost it?
> 
> Sorry, when you first posted it I was crazy busy before going on a two
> week vacation/pilgrimage (and in fact I'm still in Jerusalem, but I'm
> going to be heading home soon).  I am starting to work on some
> patches, and expect to have time to do more work on the airplane.  The
> patches are queued up, and so I haven't lost them.  So feel free to
> repost them if you want, but it's not necessary unless you have some
> fixup to the patch.

Sure, no worry, this is nothing urgent, I just wanted to make sure it
doesn't fall thought the cracks. If you have the latest one I was
replying to then everything should be ok.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-18 14:15         ` Vlastimil Babka
@ 2015-11-27  9:38           ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-27  9:38 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Wed 18-11-15 15:15:29, Vlastimil Babka wrote:
> On 11/10/2015 01:51 PM, Michal Hocko wrote:
> > On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
> >> On 5.11.2015 17:15, mhocko@kernel.org wrote:
> >> > From: Michal Hocko <mhocko@suse.com>
> >> > 
> >> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> >> > around 2.6.12 it has been ignored for low order allocations. Yet we have
> >> > the full kernel tree with its usage for apparently order-0 allocations.
> >> > This is really confusing because __GFP_REPEAT is explicitly documented
> >> > to allow allocation failures which is a weaker semantic than the current
> >> > order-0 has (basically nofail).
> >> > 
> >> > Let's simply reap out __GFP_REPEAT from those places. This would allow
> >> > to identify place which really need allocator to retry harder and
> >> > formulate a more specific semantic for what the flag is supposed to do
> >> > actually.
> >> 
> >> So at first I thought "yeah that's obvious", but then after some more thinking,
> >> I'm not so sure anymore.
> > 
> > Thanks for looking into this! The primary purpose of this patch series was
> > to start the discussion. I've only now realized I forgot to add RFC, sorry
> > about that.
> > 
> >> I think we should formulate the semantic first, then do any changes. Also, let's
> >> look at the flag description (which comes from pre-git):
> > 
> > It's rather hard to formulate one without examining the current users...
> 
> Sure, but changing existing users is a different thing :)

Chicken & Egg I guess?

> >>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> >>  * _might_ fail.  This depends upon the particular VM implementation.
> >> 
> >> So we say it's implementation detail, and IIRC the same is said about which
> >> orders are considered costly and which not, and the associated rules. So, can we
> >> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
> >> current implementation? And is it a problem that they do that?
> > 
> > Well, I think that many users simply copy&pasted the code along with the
> > flag. I have failed to find any justification for adding this flag for
> > basically all the cases I've checked.
> > 
> > My understanding is that the overal motivation for the flag was to
> > fortify the allocation requests rather than weaken them. But if we were
> > literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
> > orders. It is true that the later one is so only implicitly - and as an
> > implementation detail.
> 
> OK I admit I didn't realize fully that __GFP_REPEAT is supposed to be weaker,
> although you did write it quite explicitly in the changelog. It's just
> completely counterintuitive given the name of the flag!

Yeah, I guess this is basically because this has always been for costly
allocations.

[...]

I am not sure whether we found any conclusion here. Are there any strong
arguments against patch 1? I think that should be relatively
non-controversial. What about patch 2? I think it should be ok as well
as we are basically removing the flag which has never had any effect.

I would like to proceed with this further by going through remaining users.
Most of them depend on a variable size and I am not familiar with the
code so I will talk to maintainer to find out reasoning behind using the
flag. Once we have reasonable number of them I would like to go on and
rename the flag to __GFP_BEST_AFFORD and make it independent on the
order. It would still trigger OOM killer where applicable but wouldn't
retry endlessly.

Does this sound like a reasonable plan?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-27  9:38           ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-27  9:38 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Wed 18-11-15 15:15:29, Vlastimil Babka wrote:
> On 11/10/2015 01:51 PM, Michal Hocko wrote:
> > On Mon 09-11-15 23:04:15, Vlastimil Babka wrote:
> >> On 5.11.2015 17:15, mhocko@kernel.org wrote:
> >> > From: Michal Hocko <mhocko@suse.com>
> >> > 
> >> > __GFP_REPEAT has a rather weak semantic but since it has been introduced
> >> > around 2.6.12 it has been ignored for low order allocations. Yet we have
> >> > the full kernel tree with its usage for apparently order-0 allocations.
> >> > This is really confusing because __GFP_REPEAT is explicitly documented
> >> > to allow allocation failures which is a weaker semantic than the current
> >> > order-0 has (basically nofail).
> >> > 
> >> > Let's simply reap out __GFP_REPEAT from those places. This would allow
> >> > to identify place which really need allocator to retry harder and
> >> > formulate a more specific semantic for what the flag is supposed to do
> >> > actually.
> >> 
> >> So at first I thought "yeah that's obvious", but then after some more thinking,
> >> I'm not so sure anymore.
> > 
> > Thanks for looking into this! The primary purpose of this patch series was
> > to start the discussion. I've only now realized I forgot to add RFC, sorry
> > about that.
> > 
> >> I think we should formulate the semantic first, then do any changes. Also, let's
> >> look at the flag description (which comes from pre-git):
> > 
> > It's rather hard to formulate one without examining the current users...
> 
> Sure, but changing existing users is a different thing :)

Chicken & Egg I guess?

> >>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> >>  * _might_ fail.  This depends upon the particular VM implementation.
> >> 
> >> So we say it's implementation detail, and IIRC the same is said about which
> >> orders are considered costly and which not, and the associated rules. So, can we
> >> blame callers that happen to use __GFP_REPEAT essentially as a no-op in the
> >> current implementation? And is it a problem that they do that?
> > 
> > Well, I think that many users simply copy&pasted the code along with the
> > flag. I have failed to find any justification for adding this flag for
> > basically all the cases I've checked.
> > 
> > My understanding is that the overal motivation for the flag was to
> > fortify the allocation requests rather than weaken them. But if we were
> > literal then __GFP_REPEAT is in fact weaker than GFP_KERNEL for lower
> > orders. It is true that the later one is so only implicitly - and as an
> > implementation detail.
> 
> OK I admit I didn't realize fully that __GFP_REPEAT is supposed to be weaker,
> although you did write it quite explicitly in the changelog. It's just
> completely counterintuitive given the name of the flag!

Yeah, I guess this is basically because this has always been for costly
allocations.

[...]

I am not sure whether we found any conclusion here. Are there any strong
arguments against patch 1? I think that should be relatively
non-controversial. What about patch 2? I think it should be ok as well
as we are basically removing the flag which has never had any effect.

I would like to proceed with this further by going through remaining users.
Most of them depend on a variable size and I am not familiar with the
code so I will talk to maintainer to find out reasoning behind using the
flag. Once we have reasonable number of them I would like to go on and
rename the flag to __GFP_BEST_AFFORD and make it independent on the
order. It would still trigger OOM killer where applicable but wouldn't
retry endlessly.

Does this sound like a reasonable plan?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-27  9:38           ` Michal Hocko
@ 2015-11-28 10:08             ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-28 10:08 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Fri 27-11-15 10:38:07, Michal Hocko wrote:
[...]
> I am not sure whether we found any conclusion here. Are there any strong
> arguments against patch 1? I think that should be relatively
> non-controversial. What about patch 2? I think it should be ok as well
> as we are basically removing the flag which has never had any effect.
> 
> I would like to proceed with this further by going through remaining users.
> Most of them depend on a variable size and I am not familiar with the
> code so I will talk to maintainer to find out reasoning behind using the
> flag. Once we have reasonable number of them I would like to go on and
> rename the flag to __GFP_BEST_AFFORD and make it independent on the

ble, __GFP_BEST_EFFORT I meant of course...

> order. It would still trigger OOM killer where applicable but wouldn't
> retry endlessly.
> 
> Does this sound like a reasonable plan?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-28 10:08             ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-11-28 10:08 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Fri 27-11-15 10:38:07, Michal Hocko wrote:
[...]
> I am not sure whether we found any conclusion here. Are there any strong
> arguments against patch 1? I think that should be relatively
> non-controversial. What about patch 2? I think it should be ok as well
> as we are basically removing the flag which has never had any effect.
> 
> I would like to proceed with this further by going through remaining users.
> Most of them depend on a variable size and I am not familiar with the
> code so I will talk to maintainer to find out reasoning behind using the
> flag. Once we have reasonable number of them I would like to go on and
> rename the flag to __GFP_BEST_AFFORD and make it independent on the

ble, __GFP_BEST_EFFORT I meant of course...

> order. It would still trigger OOM killer where applicable but wouldn't
> retry endlessly.
> 
> Does this sound like a reasonable plan?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-27  9:38           ` Michal Hocko
@ 2015-11-30 17:02             ` Vlastimil Babka
  -1 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-30 17:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 11/27/2015 10:38 AM, Michal Hocko wrote:
> On Wed 18-11-15 15:15:29, Vlastimil Babka wrote:
>
> I am not sure whether we found any conclusion here. Are there any strong
> arguments against patch 1? I think that should be relatively
> non-controversial.

Agreed.

> What about patch 2? I think it should be ok as well
> as we are basically removing the flag which has never had any effect.

Right.

> I would like to proceed with this further by going through remaining users.
> Most of them depend on a variable size and I am not familiar with the
> code so I will talk to maintainer to find out reasoning behind using the
> flag. Once we have reasonable number of them I would like to go on and
> rename the flag to __GFP_BEST_AFFORD and make it independent on the
> order. It would still trigger OOM killer where applicable but wouldn't
> retry endlessly.
>
> Does this sound like a reasonable plan?

I think we should consider all the related flags together before 
starting renaming them. So IIUC the current state is:

~__GFP_DIRECT_RECLAIM - no reclaim/compaction, fails regardless of 
order; good for allocations that prefer their fallback to the latency of 
reclaim/compaction

__GFP_NORETRY - only one reclaim and two compaction attempts, then fails 
regardless of order; some tradeoff between allocation latency and fallback?

__GFP_REPEAT - for costly orders, tries harder to reclaim before oom, 
otherwise no difference - doesn't fail for non-costly orders, although 
comment says it could.

__GFP_NOFAIL - cannot fail

So the issue I see with simply renaming __GFP_REPEAT to 
__GFP_BEST_AFFORD and making it possible to fail for low orders, is that 
it will conflate the new failure possibility with the existing "try 
harder to reclaim before oom". As I mentioned before, "trying harder" 
could be also extended to mean something for compaction, but that would 
further muddle the meaning of the flag. Maybe the cleanest solution 
would be to have separate flags for "possible to fail" (let's say 
__GFP_MAYFAIL for now) and "try harder" (e.g. __GFP_TRY_HARDER)? And 
introduce two new higher-level "flags" of a GFP_* kind, that callers 
would use instead of GFP_KERNEL, where one would mean 
GFP_KERNEL|__GFP_MAYFAIL and the other 
GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.

The second thing to consider, is __GFP_NORETRY useful? The latency 
savings are quite vague. Maybe we could just remove this flag to make 
space for __GFP_MAYFAIL?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-11-30 17:02             ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-11-30 17:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 11/27/2015 10:38 AM, Michal Hocko wrote:
> On Wed 18-11-15 15:15:29, Vlastimil Babka wrote:
>
> I am not sure whether we found any conclusion here. Are there any strong
> arguments against patch 1? I think that should be relatively
> non-controversial.

Agreed.

> What about patch 2? I think it should be ok as well
> as we are basically removing the flag which has never had any effect.

Right.

> I would like to proceed with this further by going through remaining users.
> Most of them depend on a variable size and I am not familiar with the
> code so I will talk to maintainer to find out reasoning behind using the
> flag. Once we have reasonable number of them I would like to go on and
> rename the flag to __GFP_BEST_AFFORD and make it independent on the
> order. It would still trigger OOM killer where applicable but wouldn't
> retry endlessly.
>
> Does this sound like a reasonable plan?

I think we should consider all the related flags together before 
starting renaming them. So IIUC the current state is:

~__GFP_DIRECT_RECLAIM - no reclaim/compaction, fails regardless of 
order; good for allocations that prefer their fallback to the latency of 
reclaim/compaction

__GFP_NORETRY - only one reclaim and two compaction attempts, then fails 
regardless of order; some tradeoff between allocation latency and fallback?

__GFP_REPEAT - for costly orders, tries harder to reclaim before oom, 
otherwise no difference - doesn't fail for non-costly orders, although 
comment says it could.

__GFP_NOFAIL - cannot fail

So the issue I see with simply renaming __GFP_REPEAT to 
__GFP_BEST_AFFORD and making it possible to fail for low orders, is that 
it will conflate the new failure possibility with the existing "try 
harder to reclaim before oom". As I mentioned before, "trying harder" 
could be also extended to mean something for compaction, but that would 
further muddle the meaning of the flag. Maybe the cleanest solution 
would be to have separate flags for "possible to fail" (let's say 
__GFP_MAYFAIL for now) and "try harder" (e.g. __GFP_TRY_HARDER)? And 
introduce two new higher-level "flags" of a GFP_* kind, that callers 
would use instead of GFP_KERNEL, where one would mean 
GFP_KERNEL|__GFP_MAYFAIL and the other 
GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.

The second thing to consider, is __GFP_NORETRY useful? The latency 
savings are quite vague. Maybe we could just remove this flag to make 
space for __GFP_MAYFAIL?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-11-30 17:02             ` Vlastimil Babka
@ 2015-12-01 16:27               ` Michal Hocko
  -1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-12-01 16:27 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Mon 30-11-15 18:02:33, Vlastimil Babka wrote:
[...]
> I think we should consider all the related flags together before starting
> renaming them. So IIUC the current state is:
> 
> ~__GFP_DIRECT_RECLAIM - no reclaim/compaction, fails regardless of order;
> good for allocations that prefer their fallback to the latency of
> reclaim/compaction
> 
> __GFP_NORETRY - only one reclaim and two compaction attempts, then fails
> regardless of order; some tradeoff between allocation latency and fallback?

Also doesn't invoke OOM killer.

> __GFP_REPEAT - for costly orders, tries harder to reclaim before oom,
> otherwise no difference - doesn't fail for non-costly orders, although
> comment says it could.
> 
> __GFP_NOFAIL - cannot fail
> 
> So the issue I see with simply renaming __GFP_REPEAT to __GFP_BEST_AFFORD
> and making it possible to fail for low orders, is that it will conflate the
> new failure possibility with the existing "try harder to reclaim before
> oom". As I mentioned before, "trying harder" could be also extended to mean
> something for compaction, but that would further muddle the meaning of the
> flag. Maybe the cleanest solution would be to have separate flags for
> "possible to fail" (let's say __GFP_MAYFAIL for now) and "try harder" (e.g.
> __GFP_TRY_HARDER)? And introduce two new higher-level "flags" of a GFP_*
> kind, that callers would use instead of GFP_KERNEL, where one would mean
> GFP_KERNEL|__GFP_MAYFAIL and the other
> GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.

I will think about that but this sounds quite confusing to me. All the
allocations on behalf of a user process are MAYFAIL basically (e.g. the
oom victim failure case) unless they are explicitly __GFP_NOFAIL. It
also sounds that ~__GFP_NOFAIL should imply MAYFAIL automatically.
__GFP_BEST_EFFORT on the other hand clearly states that the allocator
should try its best but it can fail. The way how it achieves that is
an implementation detail and users do not have to care. In your above
hierarchy of QoS we have:
- no reclaim ~__GFP_DIRECT_RECLAIM - optimistic allocation with a
  fallback (e.g. smaller allocation request)
- no destructive reclaim __GFP_NORETRY - allocation with a more
  expensive fallback (e.g. vmalloc)
- all reclaim types but only fail if there is no good hope for success
  __GFP_BEST_EFFORT (fail rather than invoke the OOM killer second time)
  user allocations
- no failure allowed __GFP_NOFAIL - failure mode is not acceptable

we can keep the current implicit "low order imply __GFP_NOFAIL" behavior
of the GFP_KERNEL and still offer users to use __GFP_BEST_EFFORT as a
way to override it.

> The second thing to consider, is __GFP_NORETRY useful? The latency savings
> are quite vague. Maybe we could just remove this flag to make space for
> __GFP_MAYFAIL?

There are users who would like to see some reclaim but rather fail then
see the OOM killer. I assume there are also users who can handle the
failure but the OOM killer is not a big deal for them. I think that
GFP_USER is an example of the later.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-12-01 16:27               ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2015-12-01 16:27 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Andrew Morton, LKML

On Mon 30-11-15 18:02:33, Vlastimil Babka wrote:
[...]
> I think we should consider all the related flags together before starting
> renaming them. So IIUC the current state is:
> 
> ~__GFP_DIRECT_RECLAIM - no reclaim/compaction, fails regardless of order;
> good for allocations that prefer their fallback to the latency of
> reclaim/compaction
> 
> __GFP_NORETRY - only one reclaim and two compaction attempts, then fails
> regardless of order; some tradeoff between allocation latency and fallback?

Also doesn't invoke OOM killer.

> __GFP_REPEAT - for costly orders, tries harder to reclaim before oom,
> otherwise no difference - doesn't fail for non-costly orders, although
> comment says it could.
> 
> __GFP_NOFAIL - cannot fail
> 
> So the issue I see with simply renaming __GFP_REPEAT to __GFP_BEST_AFFORD
> and making it possible to fail for low orders, is that it will conflate the
> new failure possibility with the existing "try harder to reclaim before
> oom". As I mentioned before, "trying harder" could be also extended to mean
> something for compaction, but that would further muddle the meaning of the
> flag. Maybe the cleanest solution would be to have separate flags for
> "possible to fail" (let's say __GFP_MAYFAIL for now) and "try harder" (e.g.
> __GFP_TRY_HARDER)? And introduce two new higher-level "flags" of a GFP_*
> kind, that callers would use instead of GFP_KERNEL, where one would mean
> GFP_KERNEL|__GFP_MAYFAIL and the other
> GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.

I will think about that but this sounds quite confusing to me. All the
allocations on behalf of a user process are MAYFAIL basically (e.g. the
oom victim failure case) unless they are explicitly __GFP_NOFAIL. It
also sounds that ~__GFP_NOFAIL should imply MAYFAIL automatically.
__GFP_BEST_EFFORT on the other hand clearly states that the allocator
should try its best but it can fail. The way how it achieves that is
an implementation detail and users do not have to care. In your above
hierarchy of QoS we have:
- no reclaim ~__GFP_DIRECT_RECLAIM - optimistic allocation with a
  fallback (e.g. smaller allocation request)
- no destructive reclaim __GFP_NORETRY - allocation with a more
  expensive fallback (e.g. vmalloc)
- all reclaim types but only fail if there is no good hope for success
  __GFP_BEST_EFFORT (fail rather than invoke the OOM killer second time)
  user allocations
- no failure allowed __GFP_NOFAIL - failure mode is not acceptable

we can keep the current implicit "low order imply __GFP_NOFAIL" behavior
of the GFP_KERNEL and still offer users to use __GFP_BEST_EFFORT as a
way to override it.

> The second thing to consider, is __GFP_NORETRY useful? The latency savings
> are quite vague. Maybe we could just remove this flag to make space for
> __GFP_MAYFAIL?

There are users who would like to see some reclaim but rather fail then
see the OOM killer. I assume there are also users who can handle the
failure but the OOM killer is not a big deal for them. I think that
GFP_USER is an example of the later.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
  2015-12-01 16:27               ` Michal Hocko
@ 2015-12-21 12:18                 ` Vlastimil Babka
  -1 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-12-21 12:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 12/01/2015 05:27 PM, Michal Hocko wrote:
> On Mon 30-11-15 18:02:33, Vlastimil Babka wrote:
> [...]
>> So the issue I see with simply renaming __GFP_REPEAT to __GFP_BEST_AFFORD
>> and making it possible to fail for low orders, is that it will conflate the
>> new failure possibility with the existing "try harder to reclaim before
>> oom". As I mentioned before, "trying harder" could be also extended to mean
>> something for compaction, but that would further muddle the meaning of the
>> flag. Maybe the cleanest solution would be to have separate flags for
>> "possible to fail" (let's say __GFP_MAYFAIL for now) and "try harder" (e.g.
>> __GFP_TRY_HARDER)? And introduce two new higher-level "flags" of a GFP_*
>> kind, that callers would use instead of GFP_KERNEL, where one would mean
>> GFP_KERNEL|__GFP_MAYFAIL and the other
>> GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.
>
> I will think about that but this sounds quite confusing to me. All the
> allocations on behalf of a user process are MAYFAIL basically (e.g. the
> oom victim failure case) unless they are explicitly __GFP_NOFAIL. It
> also sounds that ~__GFP_NOFAIL should imply MAYFAIL automatically.
> __GFP_BEST_EFFORT on the other hand clearly states that the allocator
> should try its best but it can fail. The way how it achieves that is
> an implementation detail and users do not have to care. In your above
> hierarchy of QoS we have:
> - no reclaim ~__GFP_DIRECT_RECLAIM - optimistic allocation with a
>    fallback (e.g. smaller allocation request)
> - no destructive reclaim __GFP_NORETRY - allocation with a more
>    expensive fallback (e.g. vmalloc)

Maybe it would be less confusing / more consistent if __GFP_NORETRY was 
renamed to __GFP_LOW_EFFORT ?

> - all reclaim types but only fail if there is no good hope for success
>    __GFP_BEST_EFFORT (fail rather than invoke the OOM killer second time)
>    user allocations
> - no failure allowed __GFP_NOFAIL - failure mode is not acceptable
>
> we can keep the current implicit "low order imply __GFP_NOFAIL" behavior
> of the GFP_KERNEL and still offer users to use __GFP_BEST_EFFORT as a
> way to override it.
>
>> The second thing to consider, is __GFP_NORETRY useful? The latency savings
>> are quite vague. Maybe we could just remove this flag to make space for
>> __GFP_MAYFAIL?
>
> There are users who would like to see some reclaim but rather fail then
> see the OOM killer. I assume there are also users who can handle the
> failure but the OOM killer is not a big deal for them. I think that
> GFP_USER is an example of the later.
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
@ 2015-12-21 12:18                 ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2015-12-21 12:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, LKML

On 12/01/2015 05:27 PM, Michal Hocko wrote:
> On Mon 30-11-15 18:02:33, Vlastimil Babka wrote:
> [...]
>> So the issue I see with simply renaming __GFP_REPEAT to __GFP_BEST_AFFORD
>> and making it possible to fail for low orders, is that it will conflate the
>> new failure possibility with the existing "try harder to reclaim before
>> oom". As I mentioned before, "trying harder" could be also extended to mean
>> something for compaction, but that would further muddle the meaning of the
>> flag. Maybe the cleanest solution would be to have separate flags for
>> "possible to fail" (let's say __GFP_MAYFAIL for now) and "try harder" (e.g.
>> __GFP_TRY_HARDER)? And introduce two new higher-level "flags" of a GFP_*
>> kind, that callers would use instead of GFP_KERNEL, where one would mean
>> GFP_KERNEL|__GFP_MAYFAIL and the other
>> GFP_KERNEL|__GFP_TRY_HARDER|__GFP_MAYFAIL.
>
> I will think about that but this sounds quite confusing to me. All the
> allocations on behalf of a user process are MAYFAIL basically (e.g. the
> oom victim failure case) unless they are explicitly __GFP_NOFAIL. It
> also sounds that ~__GFP_NOFAIL should imply MAYFAIL automatically.
> __GFP_BEST_EFFORT on the other hand clearly states that the allocator
> should try its best but it can fail. The way how it achieves that is
> an implementation detail and users do not have to care. In your above
> hierarchy of QoS we have:
> - no reclaim ~__GFP_DIRECT_RECLAIM - optimistic allocation with a
>    fallback (e.g. smaller allocation request)
> - no destructive reclaim __GFP_NORETRY - allocation with a more
>    expensive fallback (e.g. vmalloc)

Maybe it would be less confusing / more consistent if __GFP_NORETRY was 
renamed to __GFP_LOW_EFFORT ?

> - all reclaim types but only fail if there is no good hope for success
>    __GFP_BEST_EFFORT (fail rather than invoke the OOM killer second time)
>    user allocations
> - no failure allowed __GFP_NOFAIL - failure mode is not acceptable
>
> we can keep the current implicit "low order imply __GFP_NOFAIL" behavior
> of the GFP_KERNEL and still offer users to use __GFP_BEST_EFFORT as a
> way to override it.
>
>> The second thing to consider, is __GFP_NORETRY useful? The latency savings
>> are quite vague. Maybe we could just remove this flag to make space for
>> __GFP_MAYFAIL?
>
> There are users who would like to see some reclaim but rather fail then
> see the OOM killer. I assume there are also users who can handle the
> failure but the OOM killer is not a big deal for them. I think that
> GFP_USER is an example of the later.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-12-21 12:18 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-05 16:15 [PATCH 0/3] __GFP_REPEAT cleanup mhocko
2015-11-05 16:15 ` mhocko
2015-11-05 16:15 ` [PATCH 1/3] tree wide: get rid of __GFP_REPEAT for order-0 allocations part I mhocko
2015-11-05 16:15   ` mhocko
2015-11-09 22:04   ` Vlastimil Babka
2015-11-09 22:04     ` Vlastimil Babka
2015-11-10 12:51     ` Michal Hocko
2015-11-10 12:51       ` Michal Hocko
2015-11-18 14:15       ` Vlastimil Babka
2015-11-18 14:15         ` Vlastimil Babka
2015-11-27  9:38         ` Michal Hocko
2015-11-27  9:38           ` Michal Hocko
2015-11-28 10:08           ` Michal Hocko
2015-11-28 10:08             ` Michal Hocko
2015-11-30 17:02           ` Vlastimil Babka
2015-11-30 17:02             ` Vlastimil Babka
2015-12-01 16:27             ` Michal Hocko
2015-12-01 16:27               ` Michal Hocko
2015-12-21 12:18               ` Vlastimil Babka
2015-12-21 12:18                 ` Vlastimil Babka
2015-11-05 16:15 ` [PATCH 2/3] tree wide: get rid of __GFP_REPEAT for small order requests mhocko
2015-11-05 16:15   ` mhocko
2015-11-05 16:15   ` mhocko
2015-11-05 16:16 ` [PATCH 3/3] jbd2: get rid of superfluous __GFP_REPEAT mhocko
2015-11-05 16:16   ` mhocko
2015-11-06 16:17   ` [PATCH] " mhocko
2015-11-06 16:17     ` mhocko
2015-11-07  1:22     ` Tetsuo Handa
2015-11-07  1:22       ` Tetsuo Handa
2015-11-08  5:08       ` Theodore Ts'o
2015-11-08  5:08         ` Theodore Ts'o
2015-11-09  8:16         ` Michal Hocko
2015-11-09  8:16           ` Michal Hocko
2015-11-26 15:10           ` Michal Hocko
2015-11-26 15:10             ` Michal Hocko
2015-11-26 20:18             ` Theodore Ts'o
2015-11-26 20:18               ` Theodore Ts'o
2015-11-27  7:56               ` Michal Hocko
2015-11-27  7:56                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.