All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Some fixes for mm code optimization
@ 2018-05-04  6:52 Huaisheng Ye
       [not found] ` <1525416729-108201-4-git-send-email-yehs1@lenovo.com>
       [not found] ` <1525416729-108201-3-git-send-email-yehs1@lenovo.com>
  0 siblings, 2 replies; 20+ messages in thread
From: Huaisheng Ye @ 2018-05-04  6:52 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, vbabka, mgorman, pasha.tatashin, alexander.levin, hannes,
	penguin-kernel, colyli, chengnt, linux-kernel, Huaisheng Ye

These patches try to optimize existing code of mm.

First patch, removes the useless parameter order of function
finalise_ac. This function is just used by __alloc_pages_nodemask
in mm/page_alloc.c.

Second patch, modifies the local variable bit's type to unsigned int
in function gfp_zone.

Third patch, fixes a typo in debug message to avoid confusion.

All patches have been tested on Lenovo Purley product.


Huaisheng Ye (3):
  mm/page_alloc: Remove useless parameter of finalise_ac
  include/linux/gfp.h: use unsigned int in gfp_zone
  mm/page_alloc: Fix typo in debug info of calculate_node_totalpages

 include/linux/gfp.h | 2 +-
 mm/page_alloc.c     | 7 +++----
 2 files changed, 4 insertions(+), 5 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Fix typo in debug info of calculate_node_totalpages
       [not found] ` <1525416729-108201-4-git-send-email-yehs1@lenovo.com>
@ 2018-05-04 13:18   ` Michal Hocko
  2018-05-05  2:10     ` [External] " Huaisheng HS1 Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2018-05-04 13:18 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, chengnt, linux-kernel

On Fri 04-05-18 14:52:09, Huaisheng Ye wrote:
> realtotalpages is calculated by taking off absent_pages from
> spanned_pages in every zone.
> Debug message of calculate_node_totalpages shall accurately
> indicate that it is real totalpages to avoid ambiguity.

Is the printk actually useful? Why don't we simply remove it? You can
get the information from /proc/zoneinfo so why to litter the dmesg
output?

> Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1b39db4..9d57db2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5967,7 +5967,7 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
>  
>  	pgdat->node_spanned_pages = totalpages;
>  	pgdat->node_present_pages = realtotalpages;
> -	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
> +	printk(KERN_DEBUG "On node %d realtotalpages: %lu\n", pgdat->node_id,
>  							realtotalpages);
>  }
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
       [not found] ` <1525416729-108201-3-git-send-email-yehs1@lenovo.com>
@ 2018-05-04 13:35   ` Michal Hocko
  2018-05-04 15:40     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2018-05-04 13:35 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, chengnt, linux-kernel

On Fri 04-05-18 14:52:08, Huaisheng Ye wrote:
> Suggest using unsigned int instead of int for bit within gfp_zone.
> 
> Within function gfp_zone, the value of local variable bit comes from
> formal parameter flags, which's type is gfp_t. Local variable bit
> indicates the number of bits in the right shift for GFP_ZONE_TABLE
> with GFP_ZONES_SHIFT. So, variable bit shall always be unsigned
> integer, it doesn't make sense that forcing it to be a signed integer.
> 
> Current GFP_ZONEMASK is just valid as low four bits, the largest
> value of bit shall be less or equal 0x0F. But in the future, as the
> mask expands to higher bits, there will be a risk of confusion.

I am highly skeptical we will ever grow the number of zones enough
that signed vs. unsigned would matter. So I guess this all boils down to
aesthetic. I do not care either way. The generated code seems the be the
same.

> Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
> ---
>  include/linux/gfp.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 1a4582b..21551fc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -401,7 +401,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>  static inline enum zone_type gfp_zone(gfp_t flags)
>  {
>  	enum zone_type z;
> -	int bit = (__force int) (flags & GFP_ZONEMASK);
> +	unsigned int bit = (__force unsigned int) (flags & GFP_ZONEMASK);
>  
>  	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
>  					 ((1 << GFP_ZONES_SHIFT) - 1);
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-04 13:35   ` [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone Michal Hocko
@ 2018-05-04 15:40     ` Matthew Wilcox
  2018-05-04 17:50       ` [External] " Huaisheng HS1 Ye
  2018-05-06  9:32       ` Huaisheng HS1 Ye
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-04 15:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huaisheng Ye, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt,
	linux-kernel

On Fri, May 04, 2018 at 03:35:33PM +0200, Michal Hocko wrote:
> On Fri 04-05-18 14:52:08, Huaisheng Ye wrote:
> > Suggest using unsigned int instead of int for bit within gfp_zone.
> > @@ -401,7 +401,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> >  static inline enum zone_type gfp_zone(gfp_t flags)
> >  {
> >  	enum zone_type z;
> > -	int bit = (__force int) (flags & GFP_ZONEMASK);
> > +	unsigned int bit = (__force unsigned int) (flags & GFP_ZONEMASK);
> >  
> >  	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> >  					 ((1 << GFP_ZONES_SHIFT) - 1);

That reminds me.  I wanted to talk about getting rid of GFP_ZONE_TABLE.
Instead, we should encode the zone number in the bottom three bits of
the gfp mask, while preserving the rules that ZONE_NORMAL gets encoded
as zero (so GFP_KERNEL | GFP_HIGHMEM continues to work) and also leaving
__GFP_MOVABLE in bit 3 so that it can continue to be used as a flag.

So I was thinking ...

-#define ___GFP_DMA             0x01u
-#define ___GFP_HIGHMEM         0x02u
-#define ___GFP_DMA32           0x04u
+#define ___GFP_ZONE_MASK	0x07u

#define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
#define __GFP_HIGHMEM	((__force gfp_t)OPT_ZONE_HIGHMEM ^ ZONE_NORMAL)
#define __GFP_DMA32	((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
#define __GFP_MOVABLE	((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL | \
			 ___GFP_MOVABLE)
#define GFP_ZONEMASK	((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)

Then we can delete GFP_ZONE_TABLE and GFP_ZONE_BAD.
gfp_zone simply becomes:

static inline enum zone_type gfp_zone(gfp_t flags)
{
	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
}

Huaisheng Ye, would you have time to investigate this idea?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-04 15:40     ` Matthew Wilcox
@ 2018-05-04 17:50       ` Huaisheng HS1 Ye
  2018-05-06  9:32       ` Huaisheng HS1 Ye
  1 sibling, 0 replies; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-04 17:50 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, linux-kernel

> On Fri, May 04, 2018 at 03:35:33PM +0200, Michal Hocko wrote:
> > On Fri 04-05-18 14:52:08, Huaisheng Ye wrote:
> > > Suggest using unsigned int instead of int for bit within gfp_zone.
> > > @@ -401,7 +401,7 @@ static inline bool gfpflags_allow_blocking(const
> gfp_t gfp_flags)
> > >  static inline enum zone_type gfp_zone(gfp_t flags)
> > >  {
> > >  	enum zone_type z;
> > > -	int bit = (__force int) (flags & GFP_ZONEMASK);
> > > +	unsigned int bit = (__force unsigned int) (flags & GFP_ZONEMASK);
> > >
> > >  	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> > >  					 ((1 << GFP_ZONES_SHIFT) - 1);
> 
> That reminds me.  I wanted to talk about getting rid of GFP_ZONE_TABLE.
> Instead, we should encode the zone number in the bottom three bits of
> the gfp mask, while preserving the rules that ZONE_NORMAL gets encoded
> as zero (so GFP_KERNEL | GFP_HIGHMEM continues to work) and also leaving
> __GFP_MOVABLE in bit 3 so that it can continue to be used as a flag.
> 
> So I was thinking ...
> 
> -#define ___GFP_DMA             0x01u
> -#define ___GFP_HIGHMEM         0x02u
> -#define ___GFP_DMA32           0x04u
> +#define ___GFP_ZONE_MASK	0x07u
> 
> #define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
> #define __GFP_HIGHMEM	((__force gfp_t)OPT_ZONE_HIGHMEM ^
> ZONE_NORMAL)
> #define __GFP_DMA32	((__force gfp_t)OPT_ZONE_DMA32 ^
> ZONE_NORMAL)
> #define __GFP_MOVABLE	((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL | \
> 			 ___GFP_MOVABLE)
> #define GFP_ZONEMASK	((__force gfp_t)___GFP_ZONE_MASK |
> ___GFP_MOVABLE)
> 
> Then we can delete GFP_ZONE_TABLE and GFP_ZONE_BAD.
> gfp_zone simply becomes:
> 
> static inline enum zone_type gfp_zone(gfp_t flags)
> {
> 	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> }
> 
> Huaisheng Ye, would you have time to investigate this idea?

OK, it is a great pleasure for me, let me think about how it works in detail.

Sincerely,
Huaisheng, Ye
OS Team | Lenovo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 3/3] mm/page_alloc: Fix typo in debug info of calculate_node_totalpages
  2018-05-04 13:18   ` [PATCH 3/3] mm/page_alloc: Fix typo in debug info of calculate_node_totalpages Michal Hocko
@ 2018-05-05  2:10     ` Huaisheng HS1 Ye
  2018-05-09  7:48       ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-05  2:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, linux-kernel


> On Fri 04-05-18 14:52:09, Huaisheng Ye wrote:
> > realtotalpages is calculated by taking off absent_pages from
> > spanned_pages in every zone.
> > Debug message of calculate_node_totalpages shall accurately
> > indicate that it is real totalpages to avoid ambiguity.
> 
> Is the printk actually useful? Why don't we simply remove it? You can
> get the information from /proc/zoneinfo so why to litter the dmesg
> output?

Indeed, we can get the amount of pfns as spanned, present and managed
from /proc/zoneinfo after memory initialization has been finished.

But this printk is a relatively meaningful reference within dmesg log.
Especially for people who doesn't have much experience, or someone
has a plan to modify boundary of zones within free_area_init_*.

Sincerely,
Huaisheng Ye
Linux kernel | Lenovo
> 
> > Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1b39db4..9d57db2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5967,7 +5967,7 @@ static void __meminit
> calculate_node_totalpages(struct pglist_data *pgdat,
> >
> >  	pgdat->node_spanned_pages = totalpages;
> >  	pgdat->node_present_pages = realtotalpages;
> > -	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
> > +	printk(KERN_DEBUG "On node %d realtotalpages: %lu\n",
> pgdat->node_id,
> >  							realtotalpages);
> >  }
> >
> > --
> > 1.8.3.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-04 15:40     ` Matthew Wilcox
  2018-05-04 17:50       ` [External] " Huaisheng HS1 Ye
@ 2018-05-06  9:32       ` Huaisheng HS1 Ye
  2018-05-06 13:48         ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-06  9:32 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, linux-kernel


> On Fri, May 04, 2018 at 03:35:33PM +0200, Michal Hocko wrote:
> > On Fri 04-05-18 14:52:08, Huaisheng Ye wrote:
> > > Suggest using unsigned int instead of int for bit within gfp_zone.
> > > @@ -401,7 +401,7 @@ static inline bool gfpflags_allow_blocking(const
> gfp_t gfp_flags)
> > >  static inline enum zone_type gfp_zone(gfp_t flags)
> > >  {
> > >  	enum zone_type z;
> > > -	int bit = (__force int) (flags & GFP_ZONEMASK);
> > > +	unsigned int bit = (__force unsigned int) (flags & GFP_ZONEMASK);
> > >
> > >  	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> > >  					 ((1 << GFP_ZONES_SHIFT) - 1);
> 
> That reminds me.  I wanted to talk about getting rid of GFP_ZONE_TABLE.
> Instead, we should encode the zone number in the bottom three bits of
> the gfp mask, while preserving the rules that ZONE_NORMAL gets encoded
> as zero (so GFP_KERNEL | GFP_HIGHMEM continues to work) and also leaving
> __GFP_MOVABLE in bit 3 so that it can continue to be used as a flag.
> 
> So I was thinking ...
> 
> -#define ___GFP_DMA             0x01u
> -#define ___GFP_HIGHMEM         0x02u
> -#define ___GFP_DMA32           0x04u
> +#define ___GFP_ZONE_MASK	0x07u
> 
> #define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
> #define __GFP_HIGHMEM	((__force gfp_t)OPT_ZONE_HIGHMEM ^
> ZONE_NORMAL)
> #define __GFP_DMA32	((__force gfp_t)OPT_ZONE_DMA32 ^
> ZONE_NORMAL)
> #define __GFP_MOVABLE	((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL | \
> 			 ___GFP_MOVABLE)
> #define GFP_ZONEMASK	((__force gfp_t)___GFP_ZONE_MASK |
> ___GFP_MOVABLE)
> 
> Then we can delete GFP_ZONE_TABLE and GFP_ZONE_BAD.
> gfp_zone simply becomes:
> 
> static inline enum zone_type gfp_zone(gfp_t flags)
> {
> 	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> }
> 
> Huaisheng Ye, would you have time to investigate this idea?

Dear Matthew and Michal,

This idea is great, we can replace GFP_ZONE_TABLE and GFP_ZONE_BAD with it.
I have realized it preliminarily based on your code and tested it on a 2 sockets platform. Fortunately, we got a positive test result.

I made some adjustments for __GFP_HIGHMEM, this flag is special than others, because the return result of gfp_zone has two possibilities, which depend on ___GFP_MOVABLE has been enabled or disabled.
When ___GFP_MOVABLE has been enabled, ZONE_MOVABLE shall be returned. When disabled, OPT_ZONE_HIGHMEM shall be used.

#define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
#define __GFP_HIGHMEM	((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL)
#define __GFP_DMA32	((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
#define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
#define GFP_ZONEMASK	((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)

The present situation is that, based on this change, the bits of flags, __GFP_DMA and __GFP_HIGHMEM and __GFP_DMA32, have been encoded.
That is totally different from existing code, you know in kernel scope, there are many drivers or subsystems use these flags directly to realize bit manipulations like this below,
swiotlb-xen.c (drivers\xen):	flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
extent_io.c (fs\btrfs):			mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);

Because of these flags have been encoded, the above operations can cause problem.
I am trying to get a solution to resolve it. Any progress will be reported.

Sincerely,
Huaisheng Ye
Linux kernel | Lenovo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-06  9:32       ` Huaisheng HS1 Ye
@ 2018-05-06 13:48         ` Matthew Wilcox
  2018-05-06 16:17           ` Huaisheng HS1 Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-06 13:48 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel

On Sun, May 06, 2018 at 09:32:15AM +0000, Huaisheng HS1 Ye wrote:
> This idea is great, we can replace GFP_ZONE_TABLE and GFP_ZONE_BAD with it.
> I have realized it preliminarily based on your code and tested it on a 2 sockets platform. Fortunately, we got a positive test result.

Great!

> I made some adjustments for __GFP_HIGHMEM, this flag is special than others, because the return result of gfp_zone has two possibilities, which depend on ___GFP_MOVABLE has been enabled or disabled.
> When ___GFP_MOVABLE has been enabled, ZONE_MOVABLE shall be returned. When disabled, OPT_ZONE_HIGHMEM shall be used.
> 
> #define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
> #define __GFP_HIGHMEM	((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL)

I'm not sure this is right ... Let me think about this a little.

> #define __GFP_DMA32	((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
> #define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
> #define GFP_ZONEMASK	((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)
> 
> The present situation is that, based on this change, the bits of flags, __GFP_DMA and __GFP_HIGHMEM and __GFP_DMA32, have been encoded.
> That is totally different from existing code, you know in kernel scope, there are many drivers or subsystems use these flags directly to realize bit manipulations like this below,
> swiotlb-xen.c (drivers\xen):	flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
> extent_io.c (fs\btrfs):			mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> 
> Because of these flags have been encoded, the above operations can cause problem.
> I am trying to get a solution to resolve it. Any progress will be reported.

These users probably want:

flags &= GFP_RECLAIM_MASK;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-06 13:48         ` Matthew Wilcox
@ 2018-05-06 16:17           ` Huaisheng HS1 Ye
  2018-05-06 18:55             ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-06 16:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel


> -----Original Message-----
> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf Of Matthew Wilcox
> Sent: Sunday, May 06, 2018 9:48 PM
> To: Huaisheng HS1 Ye <yehs1@lenovo.com>
> Cc: Michal Hocko <mhocko@kernel.org>; akpm@linux-foundation.org;
> linux-mm@kvack.org; vbabka@suse.cz; mgorman@techsingularity.net;
> pasha.tatashin@oracle.com; alexander.levin@verizon.com;
> hannes@cmpxchg.org; penguin-kernel@I-love.SAKURA.ne.jp; colyli@suse.de;
> NingTing Cheng <chengnt@lenovo.com>; linux-kernel@vger.kernel.org
> Subject: Re: [External] Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in
> gfp_zone
> 
> On Sun, May 06, 2018 at 09:32:15AM +0000, Huaisheng HS1 Ye wrote:
> > This idea is great, we can replace GFP_ZONE_TABLE and GFP_ZONE_BAD with
> it.
> > I have realized it preliminarily based on your code and tested it on a 2 sockets
> platform. Fortunately, we got a positive test result.
> 
> Great!
> 
> > I made some adjustments for __GFP_HIGHMEM, this flag is special than
> others, because the return result of gfp_zone has two possibilities, which
> depend on ___GFP_MOVABLE has been enabled or disabled.
> > When ___GFP_MOVABLE has been enabled, ZONE_MOVABLE shall be
> returned. When disabled, OPT_ZONE_HIGHMEM shall be used.
> >
> > #define __GFP_DMA	((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
> > #define __GFP_HIGHMEM	((__force gfp_t)ZONE_MOVABLE ^
> ZONE_NORMAL)
> 
> I'm not sure this is right ... Let me think about this a little.

Upload my current patch and testing platform info for reference. This patch has been tested 
on a two sockets platform.
Here is dmesg log about zones,

 397 [    0.000000] Zone ranges:
 398 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
 399 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
 400 [    0.000000]   Normal   [mem 0x0000000100000000-0x000000277fffffff]
 401 [    0.000000]   Device   empty
 402 [    0.000000] Movable zone start for each node
 403 [    0.000000] Early memory node ranges
 404 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
 405 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
 406 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
 407 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
 408 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
 409 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]
 410 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]

 416 [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000043fffffff]
 417 [    0.000000] On node 0 totalpages: 4111666
 418 [    0.000000]   DMA zone: 64 pages used for memmap
 419 [    0.000000]   DMA zone: 23 pages reserved
 420 [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
 421 [    0.000000] mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
 422 [    0.000000]   DMA32 zone: 10935 pages used for memmap
 423 [    0.000000]   DMA32 zone: 699795 pages, LIFO batch:31
 424 [    0.000000] mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
 425 [    0.000000]   Normal zone: 53248 pages used for memmap
 426 [    0.000000]   Normal zone: 3407872 pages, LIFO batch:31
 427 [    0.000000] mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
 428 [    0.000000] Initmem setup node 1 [mem 0x0000002380000000-0x000000277fffffff]
 429 [    0.000000] On node 1 totalpages: 4194304
 430 [    0.000000]   Normal zone: 65536 pages used for memmap
 431 [    0.000000]   Normal zone: 4194304 pages, LIFO batch:31
 432 [    0.000000] mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752

 986 [    0.000000] mminit::zonelist general 0:DMA = 0:DMA
 987 [    0.000000] mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA
 988 [    0.000000] mminit::zonelist general 0:Normal = 0:Normal 0:DMA32 0:DMA 1:Normal
 989 [    0.000000] mminit::zonelist thisnode 0:DMA = 0:DMA
 990 [    0.000000] mminit::zonelist thisnode 0:DMA32 = 0:DMA32 0:DMA
 991 [    0.000000] mminit::zonelist thisnode 0:Normal = 0:Normal 0:DMA32 0:DMA
 992 [    0.000000] mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA
 993 [    0.000000] mminit::zonelist thisnode 1:Normal = 1:Normal
 994 [    0.000000] Built 2 zonelists, mobility grouping on.  Total pages: 8176164

Here is some information of ZONE_NORMAL which comes from /proc/zoneinfo
1131 Node 0, zone   Normal
1132   pages free     3171428
1133         min      9249
1134         low      12584
1135         high     15919
1136         spanned  3407872
1137         present  3407872
1138         managed  3335769
1139         protection: (0, 0, 0, 0, 0)
1140       nr_free_pages 3171428
1141       nr_zone_inactive_anon 12
1142       nr_zone_active_anon 13585
1143       nr_zone_inactive_file 37028
1144       nr_zone_active_file 12104
1145       nr_zone_unevictable 0
1146       nr_zone_write_pending 7
1147       nr_mlock     0
1148       nr_page_table_pages 1026
1149       nr_kernel_stack 10920
1150       nr_bounce    0
1151       nr_zspages   0
1152       nr_free_cma  0
1153       numa_hit     792300
1154       numa_miss    0
1155       numa_foreign 0
1156       numa_interleave 26268
1157       numa_local   768300
1158       numa_other   24000

1718 Node 1, zone   Normal
1747   pages free     3856001
1748         min      11405
1749         low      15518
1750         high     19631
1751         spanned  4194304
1752         present  4194304
1753         managed  4114482
1754         protection: (0, 0, 0, 0, 0)
1755       nr_free_pages 3856001
1756       nr_zone_inactive_anon 424
1757       nr_zone_active_anon 10679
1758       nr_zone_inactive_file 35274
1759       nr_zone_active_file 22189
1760       nr_zone_unevictable 0
1761       nr_zone_write_pending 0
1762       nr_mlock     0
1763       nr_page_table_pages 800
1764       nr_kernel_stack 9848
1765       nr_bounce    0
1766       nr_zspages   0
1767       nr_free_cma  0
1768       numa_hit     757099
1769       numa_miss    0
1770       numa_foreign 0
1771       numa_interleave 26314
1772       numa_local   712341
1773       numa_other   44758

Subject: [RFC PATCH v0.1] include/linux/gfp.h: Replace GFP_ZONE_TABLE with bit
 encoding

It works, but some drivers or subsystem shall be modified to fit
these new type __GFP flags.
They use these flags directly to realize bit manipulations like this
below.

eg.
swiotlb-xen.c (drivers\xen):    flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
extent_io.c (fs\btrfs):         mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);

Because of these flags have been encoded within this patch, the
above operations can cause problem.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
---
 include/linux/gfp.h | 49 ++++++++++---------------------------------------
 1 file changed, 10 insertions(+), 39 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..1647385 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,9 +16,7 @@
  */

 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA             0x01u
-#define ___GFP_HIGHMEM         0x02u
-#define ___GFP_DMA32           0x04u
+#define ___GFP_ZONE_MASK       0x07u
 #define ___GFP_MOVABLE         0x08u
 #define ___GFP_RECLAIMABLE     0x10u
 #define ___GFP_HIGH            0x20u
@@ -53,11 +51,11 @@
* without the underscores and use them consistently. The definitions here may
  * be used in bit comparisons.
  */
-#define __GFP_DMA      ((__force gfp_t)___GFP_DMA)
-#define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
-#define __GFP_DMA32    ((__force gfp_t)___GFP_DMA32)
+#define __GFP_DMA      ((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
+#define __GFP_HIGHMEM  ((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL)
+#define __GFP_DMA32    ((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
 #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
-#define GFP_ZONEMASK   (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#define GFP_ZONEMASK   ((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)

 /*
  * Page mobility and placement hints
@@ -370,42 +368,15 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif

-#define GFP_ZONE_TABLE ( \
-       (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)                                   \
-       | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)                       \
-       | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)               \
-       | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)                   \
-       | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)                    \
-       | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)    \
-       | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
-       | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
-)
-
-/*
- * GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32
- * __GFP_HIGHMEM and __GFP_MOVABLE that are not permitted. One flag per
- * entry starting with bit 0. Bit is set if the combination is not
- * allowed.
- */
-#define GFP_ZONE_BAD ( \
-       1 << (___GFP_DMA | ___GFP_HIGHMEM)                                    \
-       | 1 << (___GFP_DMA | ___GFP_DMA32)                                    \
-       | 1 << (___GFP_DMA32 | ___GFP_HIGHMEM)                                \
-       | 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM)                   \
-       | 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA)                 \
-       | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA)                   \
-       | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM)               \
-       | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM)  \
-)
-
 static inline enum zone_type gfp_zone(gfp_t flags)
{
        enum zone_type z;
-       int bit = (__force int) (flags & GFP_ZONEMASK);
+       z = ((__force unsigned int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;

-       z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
-                                        ((1 << GFP_ZONES_SHIFT) - 1);
-       VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
+       if (z > OPT_ZONE_HIGHMEM) {
+               z = OPT_ZONE_HIGHMEM +
+                       !!((__force unsigned int)flags & ___GFP_MOVABLE);
+       }
        return z;
 }

-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-06 16:17           ` Huaisheng HS1 Ye
@ 2018-05-06 18:55             ` Matthew Wilcox
  2018-05-07 17:16               ` Huaisheng HS1 Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-06 18:55 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel

On Sun, May 06, 2018 at 04:17:06PM +0000, Huaisheng HS1 Ye wrote:
> Upload my current patch and testing platform info for reference. This patch has been tested 
> on a two sockets platform.

Thank you!

> It works, but some drivers or subsystem shall be modified to fit
> these new type __GFP flags.
> They use these flags directly to realize bit manipulations like this
> below.
> 
> eg.
> swiotlb-xen.c (drivers\xen):    flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
> extent_io.c (fs\btrfs):         mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> 
> Because of these flags have been encoded within this patch, the
> above operations can cause problem.

I don't think this actually causes problems.  At least, no additional
problems.  These users will successfully clear __GFP_DMA and __GFP_HIGHMEM
no matter what values GFP_DMA and GFP_HIGHMEM have; the only problem will
be if someone calls them with a zone type they're not expecting (eg DMA32
for the first one or DMA for the second; or MOVABLE for either of them).
The thing is, they're already buggy in those circumstances.

>   */
> -#define __GFP_DMA      ((__force gfp_t)___GFP_DMA)
> -#define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
> -#define __GFP_DMA32    ((__force gfp_t)___GFP_DMA32)
> +#define __GFP_DMA      ((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
> +#define __GFP_HIGHMEM  ((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL)
> +#define __GFP_DMA32    ((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
>  #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
[...]
>  static inline enum zone_type gfp_zone(gfp_t flags)
> {
>         enum zone_type z;
> -       int bit = (__force int) (flags & GFP_ZONEMASK);
> +       z = ((__force unsigned int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> 
> -       z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> -                                        ((1 << GFP_ZONES_SHIFT) - 1);
> -       VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> +       if (z > OPT_ZONE_HIGHMEM) {
> +               z = OPT_ZONE_HIGHMEM +
> +                       !!((__force unsigned int)flags & ___GFP_MOVABLE);
> +       }
>         return z;
>  }

How about:

+#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^ ZONE_NORMAL)
-#define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
+#define __GFP_MOVABLE  ((__force gfp_t)ZONE_MOVABLE ^ ZONE_NORMAL | \
+					___GFP_MOVABLE)

Then I think you can just make it:

static inline enum zone_type gfp_zone(gfp_t flags)
{
	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
}

> @@ -370,42 +368,15 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>  #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
>  #endif

You should be able to delete GFP_ZONES_SHIFT too.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-06 18:55             ` Matthew Wilcox
@ 2018-05-07 17:16               ` Huaisheng HS1 Ye
  2018-05-07 18:44                 ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-07 17:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel

Dear Matthew,

I will try to explain them in depth. Correct me if anything wrong.
> 
> On Sun, May 06, 2018 at 04:17:06PM +0000, Huaisheng HS1 Ye wrote:
> > Upload my current patch and testing platform info for reference. This patch
> has been tested
> > on a two sockets platform.
> 
> Thank you!
My pleasure.

> > It works, but some drivers or subsystem shall be modified to fit
> > these new type __GFP flags.
> > They use these flags directly to realize bit manipulations like this
> > below.
> >
> > eg.
> > swiotlb-xen.c (drivers\xen):    flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
> > extent_io.c (fs\btrfs):         mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> >
> > Because of these flags have been encoded within this patch, the
> > above operations can cause problem.
> 
> I don't think this actually causes problems.  At least, no additional
> problems.  These users will successfully clear __GFP_DMA and
> __GFP_HIGHMEM
> no matter what values GFP_DMA and GFP_HIGHMEM have; the only problem
> will be if someone calls them with a zone type they're not expecting (eg DMA32
> for the first one or DMA for the second; or MOVABLE for either of them).
> The thing is, they're already buggy in those circumstances.

I hope it couldn't cause problem, but based on my analyzation it has the potential to go wrong if users still use the flags as usual, which are __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
Let me take an example with my testing platform, these logics are much abstract, an example will be helpful.

There is a two sockets X86_64 server, No HIGHMEM and it has 16 + 16GB memories.
Its zone types shall be like this below,

ZONE_DMA				0		0b0000
ZONE_DMA32				1		0b0001
ZONE_NORMAL			2		0b0010
(OPT_ZONE_HIGHMEM)	2		0b0010
ZONE_MOVABLE			3		0b0011
ZONE_DEVICE				4		0b0100 (virtual zone)
__MAX_NR_ZONES		5

__GFP_DMA		= ZONE_DMA    			^ ZONE_NORMAL= 0b0010
__GFP_DMA32		= ZONE_DMA32  		^ ZONE_NORMAL= 0b0011
__GFP_HIGHMEM = OPT_ZONE_HIGHMEM ^ ZONE_NORMAL = 0b0000
__GFP_MOVABLE	= ZONE_MOVABLE ^ ZONE_NORMAL | ___GFP_MOVABLE = 0b1001

Eg.
If a driver uses flags like this below,
Step 1:
gfp_mask  |  __GFP_DMA32;	
(0b 0000		|	0b 0011	= 0b 0011)
gfp_mask's low four bits shall equal to 0011, assuming no __GFP_MOVABLE

Step 2:
gfp_mask  & ~__GFP_DMA;	
(0b 0011	 & ~0b0010   = 0b0001)
gfp_mask's low four bits shall equal to 0001 now, then when it enter gfp_zone(),

return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
(0b0001 ^ 0b0010 = 0b0011)
You know 0011 means that ZONE_MOVABLE will be returned.
In this case, error can be found, because gfp_mask needs to get ZONE_DMA32 originally.
But with existing GFP_ZONE_TABLE/BAD, it is correct. Because the bits are way of 0x1, 0x2, 0x4, 0x8

I just want to show a case of failure, please don't blame me that use case was invented.
Again, your idea is great in my eyes, which has much advantages than ZONE_TABLE/BAD.
But if we use this idea, that means other subsystem or driver shall not use the flags as existing way.
Of course, this limitation only exists in low 3 bits of gfp_t. The remaining high bits can be used as usual.

This is my opinion, maybe it is not accurate, but I really worry about it.

> >   */
> > -#define __GFP_DMA      ((__force gfp_t)___GFP_DMA)
> > -#define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
> > -#define __GFP_DMA32    ((__force gfp_t)___GFP_DMA32)
> > +#define __GFP_DMA      ((__force gfp_t)OPT_ZONE_DMA ^
> ZONE_NORMAL)
> > +#define __GFP_HIGHMEM  ((__force gfp_t)ZONE_MOVABLE ^
> ZONE_NORMAL)
> > +#define __GFP_DMA32    ((__force gfp_t)OPT_ZONE_DMA32 ^
> ZONE_NORMAL)
> >  #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /*
> ZONE_MOVABLE allowed */
> [...]
> >  static inline enum zone_type gfp_zone(gfp_t flags)
> > {
> >         enum zone_type z;
> > -       int bit = (__force int) (flags & GFP_ZONEMASK);
> > +       z = ((__force unsigned int)flags & ___GFP_ZONE_MASK) ^
> ZONE_NORMAL;
> >
> > -       z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> > -                                        ((1 << GFP_ZONES_SHIFT) - 1);
> > -       VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> > +       if (z > OPT_ZONE_HIGHMEM) {
> > +               z = OPT_ZONE_HIGHMEM +
> > +                       !!((__force unsigned int)flags & ___GFP_MOVABLE);
> > +       }
> >         return z;
> >  }
> 
> How about:
> 
> +#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^
> ZONE_NORMAL)
> -#define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /*
> ZONE_MOVABLE allowed */
> +#define __GFP_MOVABLE  ((__force gfp_t)ZONE_MOVABLE ^
> ZONE_NORMAL | \
> +					___GFP_MOVABLE)
> 
> Then I think you can just make it:
> 
> static inline enum zone_type gfp_zone(gfp_t flags)
> {
> 	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> }
Sorry, I think it has risk in this way, let me introduce a failure case for example.

Now suppose that, there is a flag should represent DMA flag with movable.
It should be like this below,
__GFP_DMA | __GFP_MOVABLE
(0b 0010       |   0b 1001   = 0b 1011)
Normally, gfp_zone shall return ZONE_DMA but with MOVABLE policy, right?
But with your code, gfp_zone will return ZONE_DMA32 with MOVABLE policy.
(0b 1011  ^  0b 0010 = 1001)

You can find that something wrong happens, so that is why I make gfp_zone more complicated than yours.

> > @@ -370,42 +368,15 @@ static inline bool gfpflags_allow_blocking(const
> gfp_t gfp_flags)
> >  #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
> >  #endif
> 
> You should be able to delete GFP_ZONES_SHIFT too.
Yes, you are right.

Sincerely,
Huaisheng Ye | 叶怀胜
Linux kernel | Lenovo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-07 17:16               ` Huaisheng HS1 Ye
@ 2018-05-07 18:44                 ` Matthew Wilcox
  2018-05-07 21:25                   ` David Sterba
  2018-05-08  0:25                   ` Huaisheng HS1 Ye
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-07 18:44 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel

On Mon, May 07, 2018 at 05:16:50PM +0000, Huaisheng HS1 Ye wrote:
> I hope it couldn't cause problem, but based on my analyzation it has the potential to go wrong if users still use the flags as usual, which are __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
> Let me take an example with my testing platform, these logics are much abstract, an example will be helpful.
> 
> There is a two sockets X86_64 server, No HIGHMEM and it has 16 + 16GB memories.
> Its zone types shall be like this below,
> 
> ZONE_DMA		0		0b0000
> ZONE_DMA32		1		0b0001
> ZONE_NORMAL		2		0b0010
> (OPT_ZONE_HIGHMEM)	2		0b0010
> ZONE_MOVABLE		3		0b0011
> ZONE_DEVICE		4		0b0100 (virtual zone)
> __MAX_NR_ZONES	5
> 
> __GFP_DMA	= ZONE_DMA ^ ZONE_NORMAL= 0b0010
> __GFP_DMA32	= ZONE_DMA32 ^ ZONE_NORMAL= 0b0011
> __GFP_HIGHMEM = OPT_ZONE_HIGHMEM ^ ZONE_NORMAL = 0b0000
> __GFP_MOVABLE	= ZONE_MOVABLE ^ ZONE_NORMAL | ___GFP_MOVABLE = 0b1001
> 
> Eg.
> If a driver uses flags like this below,
> Step 1:
> gfp_mask  |  __GFP_DMA32;	
> (0b 0000		|	0b 0011	= 0b 0011)
> gfp_mask's low four bits shall equal to 0011, assuming no __GFP_MOVABLE
> 
> Step 2:
> gfp_mask  & ~__GFP_DMA;	
> (0b 0011	 & ~0b0010   = 0b0001)
> gfp_mask's low four bits shall equal to 0001 now, then when it enter gfp_zone(),
> 
> return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> (0b0001 ^ 0b0010 = 0b0011)
> You know 0011 means that ZONE_MOVABLE will be returned.
> In this case, error can be found, because gfp_mask needs to get ZONE_DMA32 originally.
> But with existing GFP_ZONE_TABLE/BAD, it is correct. Because the bits are way of 0x1, 0x2, 0x4, 0x8

Yes, I understand your point here.  My point was that this was already a bug;
the caller shouldn't simply be clearing __GFP_DMA; they really mean to clear
all of the GFP_ZONE bits so that they allocate from ZONE_NORMAL.  And for
that, they should be using ~GFP_ZONEMASK

Unless they already know, of course.  For example, this one in
arch/x86/mm/pgtable.c is fine:

        if (strcmp(arg, "nohigh") == 0)
                __userpte_alloc_gfp &= ~__GFP_HIGHMEM;

because it knows that __userpte_alloc_gfp can only have __GFP_HIGHMEM set.

But something like btrfs should almost certainly be using ~GFP_ZONEMASK.

> > +#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^
> > ZONE_NORMAL)
> > -#define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /*
> > ZONE_MOVABLE allowed */
> > +#define __GFP_MOVABLE  ((__force gfp_t)ZONE_MOVABLE ^
> > ZONE_NORMAL | \
> > +					___GFP_MOVABLE)
> > 
> > Then I think you can just make it:
> > 
> > static inline enum zone_type gfp_zone(gfp_t flags)
> > {
> > 	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> > }
> Sorry, I think it has risk in this way, let me introduce a failure case for example.
> 
> Now suppose that, there is a flag should represent DMA flag with movable.
> It should be like this below,
> __GFP_DMA | __GFP_MOVABLE
> (0b 0010       |   0b 1001   = 0b 1011)
> Normally, gfp_zone shall return ZONE_DMA but with MOVABLE policy, right?

No, if you somehow end up with __GFP_MOVABLE | __GFP_DMA, it should give you
ZONE_DMA.

> But with your code, gfp_zone will return ZONE_DMA32 with MOVABLE policy.
> (0b 1011  ^  0b 0010 = 1001)

___GFP_ZONE_MASK is 0x7, so it excludes __GFP_MOVABLE.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-07 18:44                 ` Matthew Wilcox
@ 2018-05-07 21:25                   ` David Sterba
  2018-05-08  0:25                     ` Matthew Wilcox
  2018-05-09 14:57                     ` Huaisheng HS1 Ye
  2018-05-08  0:25                   ` Huaisheng HS1 Ye
  1 sibling, 2 replies; 20+ messages in thread
From: David Sterba @ 2018-05-07 21:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Huaisheng HS1 Ye, Michal Hocko, akpm, linux-mm, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, linux-kernel

On Mon, May 07, 2018 at 11:44:10AM -0700, Matthew Wilcox wrote:
> On Mon, May 07, 2018 at 05:16:50PM +0000, Huaisheng HS1 Ye wrote:
> > I hope it couldn't cause problem, but based on my analyzation it has the potential to go wrong if users still use the flags as usual, which are __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
> > Let me take an example with my testing platform, these logics are much abstract, an example will be helpful.
> > 
> > There is a two sockets X86_64 server, No HIGHMEM and it has 16 + 16GB memories.
> > Its zone types shall be like this below,
> > 
> > ZONE_DMA		0		0b0000
> > ZONE_DMA32		1		0b0001
> > ZONE_NORMAL		2		0b0010
> > (OPT_ZONE_HIGHMEM)	2		0b0010
> > ZONE_MOVABLE		3		0b0011
> > ZONE_DEVICE		4		0b0100 (virtual zone)
> > __MAX_NR_ZONES	5
> > 
> > __GFP_DMA	= ZONE_DMA ^ ZONE_NORMAL= 0b0010
> > __GFP_DMA32	= ZONE_DMA32 ^ ZONE_NORMAL= 0b0011
> > __GFP_HIGHMEM = OPT_ZONE_HIGHMEM ^ ZONE_NORMAL = 0b0000
> > __GFP_MOVABLE	= ZONE_MOVABLE ^ ZONE_NORMAL | ___GFP_MOVABLE = 0b1001
> > 
> > Eg.
> > If a driver uses flags like this below,
> > Step 1:
> > gfp_mask  |  __GFP_DMA32;	
> > (0b 0000		|	0b 0011	= 0b 0011)
> > gfp_mask's low four bits shall equal to 0011, assuming no __GFP_MOVABLE
> > 
> > Step 2:
> > gfp_mask  & ~__GFP_DMA;	
> > (0b 0011	 & ~0b0010   = 0b0001)
> > gfp_mask's low four bits shall equal to 0001 now, then when it enter gfp_zone(),
> > 
> > return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> > (0b0001 ^ 0b0010 = 0b0011)
> > You know 0011 means that ZONE_MOVABLE will be returned.
> > In this case, error can be found, because gfp_mask needs to get ZONE_DMA32 originally.
> > But with existing GFP_ZONE_TABLE/BAD, it is correct. Because the bits are way of 0x1, 0x2, 0x4, 0x8
> 
> Yes, I understand your point here.  My point was that this was already a bug;
> the caller shouldn't simply be clearing __GFP_DMA; they really mean to clear
> all of the GFP_ZONE bits so that they allocate from ZONE_NORMAL.  And for
> that, they should be using ~GFP_ZONEMASK
> 
> Unless they already know, of course.  For example, this one in
> arch/x86/mm/pgtable.c is fine:
> 
>         if (strcmp(arg, "nohigh") == 0)
>                 __userpte_alloc_gfp &= ~__GFP_HIGHMEM;
> 
> because it knows that __userpte_alloc_gfp can only have __GFP_HIGHMEM set.
> 
> But something like btrfs should almost certainly be using ~GFP_ZONEMASK.

Agreed, the direct use of __GFP_DMA32 was added in 3ba7ab220e8918176c6f
to substitute GFP_NOFS, so the allocation flags are less restrictive but
still acceptable for allocation from slab.

The requirement from btrfs is to avoid highmem, the 'must be acceptable
for slab' requirement is more MM internal and should have been hidden
under some opaque flag mask. There was no strong need for that at the
time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-07 18:44                 ` Matthew Wilcox
  2018-05-07 21:25                   ` David Sterba
@ 2018-05-08  0:25                   ` Huaisheng HS1 Ye
  1 sibling, 0 replies; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  0:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel


> On Mon, May 07, 2018 at 05:16:50PM +0000, Huaisheng HS1 Ye wrote:
> > I hope it couldn't cause problem, but based on my analyzation it has the
> potential to go wrong if users still use the flags as usual, which are __GFP_DMA,
> __GFP_DMA32 and __GFP_HIGHMEM.
> > Let me take an example with my testing platform, these logics are much
> abstract, an example will be helpful.
> >
> > There is a two sockets X86_64 server, No HIGHMEM and it has 16 + 16GB
> memories.
> > Its zone types shall be like this below,
> >
> > ZONE_DMA		0		0b0000
> > ZONE_DMA32		1		0b0001
> > ZONE_NORMAL		2		0b0010
> > (OPT_ZONE_HIGHMEM)	2		0b0010
> > ZONE_MOVABLE		3		0b0011
> > ZONE_DEVICE		4		0b0100 (virtual zone)
> > __MAX_NR_ZONES	5
> >
> > __GFP_DMA	= ZONE_DMA ^ ZONE_NORMAL= 0b0010
> > __GFP_DMA32	= ZONE_DMA32 ^ ZONE_NORMAL= 0b0011
> > __GFP_HIGHMEM = OPT_ZONE_HIGHMEM ^ ZONE_NORMAL = 0b0000
> > __GFP_MOVABLE	= ZONE_MOVABLE ^ ZONE_NORMAL |
> ___GFP_MOVABLE = 0b1001
> >
> > Eg.
> > If a driver uses flags like this below,
> > Step 1:
> > gfp_mask  |  __GFP_DMA32;
> > (0b 0000		|	0b 0011	= 0b 0011)
> > gfp_mask's low four bits shall equal to 0011, assuming no __GFP_MOVABLE
> >
> > Step 2:
> > gfp_mask  & ~__GFP_DMA;
> > (0b 0011	 & ~0b0010   = 0b0001)
> > gfp_mask's low four bits shall equal to 0001 now, then when it enter
> gfp_zone(),
> >
> > return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> > (0b0001 ^ 0b0010 = 0b0011)
> > You know 0011 means that ZONE_MOVABLE will be returned.
> > In this case, error can be found, because gfp_mask needs to get
> ZONE_DMA32 originally.
> > But with existing GFP_ZONE_TABLE/BAD, it is correct. Because the bits are
> way of 0x1, 0x2, 0x4, 0x8
> 
> Yes, I understand your point here.  My point was that this was already a bug;
> the caller shouldn't simply be clearing __GFP_DMA; they really mean to clear
> all of the GFP_ZONE bits so that they allocate from ZONE_NORMAL.  And for
> that, they should be using ~GFP_ZONEMASK
That is great, if they can follow this principle, I don't worry it. Maybe I am too cautious.

> 
> Unless they already know, of course.  For example, this one in
> arch/x86/mm/pgtable.c is fine:
> 
>         if (strcmp(arg, "nohigh") == 0)
>                 __userpte_alloc_gfp &= ~__GFP_HIGHMEM;
> 
> because it knows that __userpte_alloc_gfp can only have __GFP_HIGHMEM set.
> 
> But something like btrfs should almost certainly be using ~GFP_ZONEMASK.


> > > +#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^
> > > ZONE_NORMAL)
> > > -#define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /*
> > > ZONE_MOVABLE allowed */
> > > +#define __GFP_MOVABLE  ((__force gfp_t)ZONE_MOVABLE ^
> > > ZONE_NORMAL | \
> > > +					___GFP_MOVABLE)
> > >
> > > Then I think you can just make it:
> > >
> > > static inline enum zone_type gfp_zone(gfp_t flags)
> > > {
> > > 	return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> > > }
> > Sorry, I think it has risk in this way, let me introduce a failure case for
> example.
> >
> > Now suppose that, there is a flag should represent DMA flag with movable.
> > It should be like this below,
> > __GFP_DMA | __GFP_MOVABLE
> > (0b 0010       |   0b 1001   = 0b 1011)
> > Normally, gfp_zone shall return ZONE_DMA but with MOVABLE policy, right?
> 
> No, if you somehow end up with __GFP_MOVABLE | __GFP_DMA, it should give
> you ZONE_DMA.
Exactly, it should return ZONE_DMA, that's what I thought.

> 
> > But with your code, gfp_zone will return ZONE_DMA32 with MOVABLE
> >policy.
> > (0b 1011  ^  0b 0010 = 1001)
> 
> ___GFP_ZONE_MASK is 0x7, so it excludes __GFP_MOVABLE.
Sorry, I made a mistake here. I rewrite it as below.

((__GFP_DMA | __GFP_MOVABLE) & ___GFP_ZONE_MASK)
   ((0b 0010  |  0b 1001  = 0b 1011) & 0b 0111)	= 0b 0011

0b 0011 ^ 0b 0010 = 0b 0001
So ZONE_DMA32 will be returned, but what user needs is ZONE_DMA.

Thanks,
Huaisheng

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-07 21:25                   ` David Sterba
@ 2018-05-08  0:25                     ` Matthew Wilcox
  2018-05-09  9:36                       ` David Sterba
  2018-05-09 14:57                     ` Huaisheng HS1 Ye
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-08  0:25 UTC (permalink / raw)
  To: dsterba, Huaisheng HS1 Ye, Michal Hocko, akpm, linux-mm, vbabka,
	mgorman, pasha.tatashin, alexander.levin, hannes, penguin-kernel,
	colyli, NingTing Cheng, linux-kernel

On Mon, May 07, 2018 at 11:25:01PM +0200, David Sterba wrote:
> On Mon, May 07, 2018 at 11:44:10AM -0700, Matthew Wilcox wrote:
> > But something like btrfs should almost certainly be using ~GFP_ZONEMASK.
> 
> Agreed, the direct use of __GFP_DMA32 was added in 3ba7ab220e8918176c6f
> to substitute GFP_NOFS, so the allocation flags are less restrictive but
> still acceptable for allocation from slab.
> 
> The requirement from btrfs is to avoid highmem, the 'must be acceptable
> for slab' requirement is more MM internal and should have been hidden
> under some opaque flag mask. There was no strong need for that at the
> time.

The GFP flags encode a multiple of different requirements.  There's
"What can the allocator do to free memory" and "what area of memory
can the allocation come from".  btrfs doesn't actually want to
allocate memory from ZONE_MOVABLE or ZONE_DMA either.  It's probably never
been called with those particular flags set, but in the spirit of
future-proofing btrfs, perhaps a patch like this is in order?

---- >8 ----

Subject: btrfs: Allocate extents from ZONE_NORMAL
From: Matthew Wilcox <mawilcox@microsoft.com>

If anyone ever passes a GFP_DMA or GFP_MOVABLE allocation flag to
allocate_extent_state, it will try to allocate memory from the wrong zone.
We just want to allocate memory from ZONE_NORMAL, so use GFP_RECLAIM_MASK
to get what we want.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329002cf..4e4a67b7b29d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -216,12 +216,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 {
 	struct extent_state *state;
 
-	/*
-	 * The given mask might be not appropriate for the slab allocator,
-	 * drop the unsupported bits
-	 */
-	mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
-	state = kmem_cache_alloc(extent_state_cache, mask);
+	state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);
 	if (!state)
 		return state;
 	state->state = 0;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 3/3] mm/page_alloc: Fix typo in debug info of calculate_node_totalpages
  2018-05-05  2:10     ` [External] " Huaisheng HS1 Ye
@ 2018-05-09  7:48       ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2018-05-09  7:48 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: akpm, linux-mm, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, linux-kernel

On Sat 05-05-18 02:10:35, Huaisheng HS1 Ye wrote:
[...]
> But this printk is a relatively meaningful reference within dmesg log.
> Especially for people who doesn't have much experience, or someone
> has a plan to modify boundary of zones within free_area_init_*.

Could you be more specific please? I am not saying that the printk is
pointless but it is DEBUG and as such it doesn't give us a very good
picture.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-08  0:25                     ` Matthew Wilcox
@ 2018-05-09  9:36                       ` David Sterba
  2018-05-15 11:54                         ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-05-09  9:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dsterba, Huaisheng HS1 Ye, Michal Hocko, akpm, linux-mm, vbabka,
	mgorman, pasha.tatashin, alexander.levin, hannes, penguin-kernel,
	colyli, NingTing Cheng, linux-kernel

On Mon, May 07, 2018 at 05:25:47PM -0700, Matthew Wilcox wrote:
> On Mon, May 07, 2018 at 11:25:01PM +0200, David Sterba wrote:
> > On Mon, May 07, 2018 at 11:44:10AM -0700, Matthew Wilcox wrote:
> > > But something like btrfs should almost certainly be using ~GFP_ZONEMASK.
> > 
> > Agreed, the direct use of __GFP_DMA32 was added in 3ba7ab220e8918176c6f
> > to substitute GFP_NOFS, so the allocation flags are less restrictive but
> > still acceptable for allocation from slab.
> > 
> > The requirement from btrfs is to avoid highmem, the 'must be acceptable
> > for slab' requirement is more MM internal and should have been hidden
> > under some opaque flag mask. There was no strong need for that at the
> > time.
> 
> The GFP flags encode a multiple of different requirements.  There's
> "What can the allocator do to free memory" and "what area of memory
> can the allocation come from".  btrfs doesn't actually want to
> allocate memory from ZONE_MOVABLE or ZONE_DMA either.  It's probably never
> been called with those particular flags set, but in the spirit of
> future-proofing btrfs, perhaps a patch like this is in order?
> 
> ---- >8 ----
> 
> Subject: btrfs: Allocate extents from ZONE_NORMAL
> From: Matthew Wilcox <mawilcox@microsoft.com>
> 
> If anyone ever passes a GFP_DMA or GFP_MOVABLE allocation flag to
> allocate_extent_state, it will try to allocate memory from the wrong zone.
> We just want to allocate memory from ZONE_NORMAL, so use GFP_RECLAIM_MASK
> to get what we want.

Looks good to me.

> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index e99b329002cf..4e4a67b7b29d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -216,12 +216,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
>  {
>  	struct extent_state *state;
>  
> -	/*
> -	 * The given mask might be not appropriate for the slab allocator,
> -	 * drop the unsupported bits
> -	 */
> -	mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);

I've noticed there's GFP_SLAB_BUG_MASK that's basically open coded here,
but this would not filter out the placement flags.

> -	state = kmem_cache_alloc(extent_state_cache, mask);

I'd prefer some comment here, it's not obvious why the mask is used.

> +	state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);
>  	if (!state)
>  		return state;
>  	state->state = 0;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-07 21:25                   ` David Sterba
  2018-05-08  0:25                     ` Matthew Wilcox
@ 2018-05-09 14:57                     ` Huaisheng HS1 Ye
  1 sibling, 0 replies; 20+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09 14:57 UTC (permalink / raw)
  To: Matthew Wilcox, dsterba
  Cc: Michal Hocko, akpm, linux-mm, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel


> On Mon, May 07, 2018 at 11:44:10AM -0700, Matthew Wilcox wrote:
> > On Mon, May 07, 2018 at 05:16:50PM +0000, Huaisheng HS1 Ye wrote:
> > > I hope it couldn't cause problem, but based on my analyzation it has the potential
> to go wrong if users still use the flags as usual, which are __GFP_DMA, __GFP_DMA32
> and __GFP_HIGHMEM.
> > > Let me take an example with my testing platform, these logics are much abstract,
> an example will be helpful.
> > >
> > > There is a two sockets X86_64 server, No HIGHMEM and it has 16 + 16GB memories.
> > > Its zone types shall be like this below,
> > >
> > > ZONE_DMA		0		0b0000
> > > ZONE_DMA32		1		0b0001
> > > ZONE_NORMAL		2		0b0010
> > > (OPT_ZONE_HIGHMEM)	2		0b0010
> > > ZONE_MOVABLE		3		0b0011
> > > ZONE_DEVICE		4		0b0100 (virtual zone)
> > > __MAX_NR_ZONES	5
> > >
> > > __GFP_DMA	= ZONE_DMA ^ ZONE_NORMAL= 0b0010
> > > __GFP_DMA32	= ZONE_DMA32 ^ ZONE_NORMAL= 0b0011
> > > __GFP_HIGHMEM = OPT_ZONE_HIGHMEM ^ ZONE_NORMAL = 0b0000
> > > __GFP_MOVABLE	= ZONE_MOVABLE ^ ZONE_NORMAL | ___GFP_MOVABLE = 0b1001
> > >
> > > Eg.
> > > If a driver uses flags like this below,
> > > Step 1:
> > > gfp_mask  |  __GFP_DMA32;
> > > (0b 0000		|	0b 0011	= 0b 0011)
> > > gfp_mask's low four bits shall equal to 0011, assuming no __GFP_MOVABLE
> > >
> > > Step 2:
> > > gfp_mask  & ~__GFP_DMA;
> > > (0b 0011	 & ~0b0010   = 0b0001)
> > > gfp_mask's low four bits shall equal to 0001 now, then when it enter gfp_zone(),
> > >
> > > return ((__force int)flags & ___GFP_ZONE_MASK) ^ ZONE_NORMAL;
> > > (0b0001 ^ 0b0010 = 0b0011)
> > > You know 0011 means that ZONE_MOVABLE will be returned.
> > > In this case, error can be found, because gfp_mask needs to get ZONE_DMA32 originally.
> > > But with existing GFP_ZONE_TABLE/BAD, it is correct. Because the bits are way of
> 0x1, 0x2, 0x4, 0x8
> >
> > Yes, I understand your point here.  My point was that this was already a bug;
> > the caller shouldn't simply be clearing __GFP_DMA; they really mean to clear
> > all of the GFP_ZONE bits so that they allocate from ZONE_NORMAL.  And for
> > that, they should be using ~GFP_ZONEMASK
> >
> > Unless they already know, of course.  For example, this one in
> > arch/x86/mm/pgtable.c is fine:
> >
> >         if (strcmp(arg, "nohigh") == 0)
> >                 __userpte_alloc_gfp &= ~__GFP_HIGHMEM;
> >
> > because it knows that __userpte_alloc_gfp can only have __GFP_HIGHMEM set.
> >
> > But something like btrfs should almost certainly be using ~GFP_ZONEMASK.
> 
> Agreed, the direct use of __GFP_DMA32 was added in 3ba7ab220e8918176c6f
> to substitute GFP_NOFS, so the allocation flags are less restrictive but
> still acceptable for allocation from slab.
> 
> The requirement from btrfs is to avoid highmem, the 'must be acceptable
> for slab' requirement is more MM internal and should have been hidden
> under some opaque flag mask. There was no strong need for that at the
> time.

Hi Matthew,

Should we add an error detection in gfp_zone? How about this?

@@ -377,6 +377,8 @@ static inline enum zone_type gfp_zone(gfp_t flags)
                z = OPT_ZONE_HIGHMEM +
                        !!((__force unsigned int)flags & ___GFP_MOVABLE);
        }
+
+       VM_BUG_ON(z > ZONE_MOVABLE);
        return z;
 }


Sincerely,
Huaisheng Ye

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-09  9:36                       ` David Sterba
@ 2018-05-15 11:54                         ` Matthew Wilcox
  2018-05-21 17:06                           ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-05-15 11:54 UTC (permalink / raw)
  To: dsterba, Huaisheng HS1 Ye, Michal Hocko, akpm, linux-mm, vbabka,
	mgorman, pasha.tatashin, alexander.levin, hannes, penguin-kernel,
	colyli, NingTing Cheng, linux-kernel

On Wed, May 09, 2018 at 11:36:59AM +0200, David Sterba wrote:
> On Mon, May 07, 2018 at 05:25:47PM -0700, Matthew Wilcox wrote:
> > On Mon, May 07, 2018 at 11:25:01PM +0200, David Sterba wrote:
> > > On Mon, May 07, 2018 at 11:44:10AM -0700, Matthew Wilcox wrote:
> > > > But something like btrfs should almost certainly be using ~GFP_ZONEMASK.
> > > 
> > > Agreed, the direct use of __GFP_DMA32 was added in 3ba7ab220e8918176c6f
> > > to substitute GFP_NOFS, so the allocation flags are less restrictive but
> > > still acceptable for allocation from slab.
> > > 
> > > The requirement from btrfs is to avoid highmem, the 'must be acceptable
> > > for slab' requirement is more MM internal and should have been hidden
> > > under some opaque flag mask. There was no strong need for that at the
> > > time.
> > 
> > The GFP flags encode a multiple of different requirements.  There's
> > "What can the allocator do to free memory" and "what area of memory
> > can the allocation come from".  btrfs doesn't actually want to
> > allocate memory from ZONE_MOVABLE or ZONE_DMA either.  It's probably never
> > been called with those particular flags set, but in the spirit of
> > future-proofing btrfs, perhaps a patch like this is in order?
> > 
> > ---- >8 ----
> > 
> > Subject: btrfs: Allocate extents from ZONE_NORMAL
> > From: Matthew Wilcox <mawilcox@microsoft.com>
> > 
> > If anyone ever passes a GFP_DMA or GFP_MOVABLE allocation flag to
> > allocate_extent_state, it will try to allocate memory from the wrong zone.
> > We just want to allocate memory from ZONE_NORMAL, so use GFP_RECLAIM_MASK
> > to get what we want.
> 
> Looks good to me.
> 
> > Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index e99b329002cf..4e4a67b7b29d 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -216,12 +216,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
> >  {
> >  	struct extent_state *state;
> >  
> > -	/*
> > -	 * The given mask might be not appropriate for the slab allocator,
> > -	 * drop the unsupported bits
> > -	 */
> > -	mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> 
> I've noticed there's GFP_SLAB_BUG_MASK that's basically open coded here,
> but this would not filter out the placement flags.
> 
> > -	state = kmem_cache_alloc(extent_state_cache, mask);
> 
> I'd prefer some comment here, it's not obvious why the mask is used.

Sorry, I dropped the ball on this.  Would you prefer:

        /* Allocate from ZONE_NORMAL */
        state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);

or

	/*
	 * Callers may pass in a mask which indicates they want to allocate
	 * from a special zone, so clear those bits here rather than forcing
	 * each caller to do it.  We only want to use their mask to indicate
	 * what strategies the memory allocator can use to free memory.
	 */
        state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);

I tend to lean towards being more terse, but it's not about me, it's
about whoever reads this code next.

> > +	state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);
> >  	if (!state)
> >  		return state;
> >  	state->state = 0;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External]  Re: [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone
  2018-05-15 11:54                         ` Matthew Wilcox
@ 2018-05-21 17:06                           ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-05-21 17:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Huaisheng HS1 Ye, Michal Hocko, akpm, linux-mm, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, linux-kernel

On Tue, May 15, 2018 at 04:54:04AM -0700, Matthew Wilcox wrote:
> > > Subject: btrfs: Allocate extents from ZONE_NORMAL
> > > From: Matthew Wilcox <mawilcox@microsoft.com>
> > > 
> > > If anyone ever passes a GFP_DMA or GFP_MOVABLE allocation flag to
> > > allocate_extent_state, it will try to allocate memory from the wrong zone.
> > > We just want to allocate memory from ZONE_NORMAL, so use GFP_RECLAIM_MASK
> > > to get what we want.
> > 
> > Looks good to me.
> > 
> > > Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index e99b329002cf..4e4a67b7b29d 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -216,12 +216,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
> > >  {
> > >  	struct extent_state *state;
> > >  
> > > -	/*
> > > -	 * The given mask might be not appropriate for the slab allocator,
> > > -	 * drop the unsupported bits
> > > -	 */
> > > -	mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> > 
> > I've noticed there's GFP_SLAB_BUG_MASK that's basically open coded here,
> > but this would not filter out the placement flags.
> > 
> > > -	state = kmem_cache_alloc(extent_state_cache, mask);
> > 
> > I'd prefer some comment here, it's not obvious why the mask is used.
> 
> Sorry, I dropped the ball on this.  Would you prefer:
> 
>         /* Allocate from ZONE_NORMAL */
>         state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);
> 
> or
> 
> 	/*
> 	 * Callers may pass in a mask which indicates they want to allocate
> 	 * from a special zone, so clear those bits here rather than forcing
> 	 * each caller to do it.  We only want to use their mask to indicate
> 	 * what strategies the memory allocator can use to free memory.
> 	 */
>         state = kmem_cache_alloc(extent_state_cache, mask & GFP_RECLAIM_MASK);
> 
> I tend to lean towards being more terse, but it's not about me, it's
> about whoever reads this code next.

I prefer the latter variant, it's clear that it's some MM stuff. Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-05-21 17:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-04  6:52 [PATCH 0/3] Some fixes for mm code optimization Huaisheng Ye
     [not found] ` <1525416729-108201-4-git-send-email-yehs1@lenovo.com>
2018-05-04 13:18   ` [PATCH 3/3] mm/page_alloc: Fix typo in debug info of calculate_node_totalpages Michal Hocko
2018-05-05  2:10     ` [External] " Huaisheng HS1 Ye
2018-05-09  7:48       ` Michal Hocko
     [not found] ` <1525416729-108201-3-git-send-email-yehs1@lenovo.com>
2018-05-04 13:35   ` [PATCH 2/3] include/linux/gfp.h: use unsigned int in gfp_zone Michal Hocko
2018-05-04 15:40     ` Matthew Wilcox
2018-05-04 17:50       ` [External] " Huaisheng HS1 Ye
2018-05-06  9:32       ` Huaisheng HS1 Ye
2018-05-06 13:48         ` Matthew Wilcox
2018-05-06 16:17           ` Huaisheng HS1 Ye
2018-05-06 18:55             ` Matthew Wilcox
2018-05-07 17:16               ` Huaisheng HS1 Ye
2018-05-07 18:44                 ` Matthew Wilcox
2018-05-07 21:25                   ` David Sterba
2018-05-08  0:25                     ` Matthew Wilcox
2018-05-09  9:36                       ` David Sterba
2018-05-15 11:54                         ` Matthew Wilcox
2018-05-21 17:06                           ` David Sterba
2018-05-09 14:57                     ` Huaisheng HS1 Ye
2018-05-08  0:25                   ` Huaisheng HS1 Ye

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.