All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@suse.cz, akpm@linux-foundation.org
Cc: hannes@cmpxchg.org, david@fromorbit.com, mgorman@suse.de,
	riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] Move away from non-failing small allocations
Date: Tue, 17 Mar 2015 23:06:34 +0900	[thread overview]
Message-ID: <201503172305.DIH52162.FOFMFOVJHLOtQS@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20150317090738.GB28112@dhcp22.suse.cz>

Michal Hocko wrote:
> On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> > Realistically, I don't think this overall effort will be successful -
> > we'll add the knob, it won't get enough testing and any attempt to
> > alter the default will be us deliberately destabilizing the kernel
> > without knowing how badly :(
> 
> Without the knob we do not allow users to test this at all though and
> the transition will _never_ happen. Which is IMHO bad.
> 

Even with the knob, quite little users will test this. The consequence is
likely that end users rush into customer support center about obscure bugs.
I'm working at a support center, and such bugs are really annoying.

> > I wonder if we can alter the behaviour only for filesystem code, so we
> > constrain the new behaviour just to that code where we're having
> > problems.  Most/all fs code goes via vfs methods so there's a reasonably
> > small set of places where we can call
> 
> We are seeing issues with the fs code now because the test cases which
> led to the current discussion exercise FS code. The code which does
> lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> we can find other subsystems if we try hard enough.

I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).

> > static inline void enter_fs_code(struct super_block *sb)
> > {
> > 	if (sb->my_small_allocations_can_fail)
> > 		current->small_allocations_can_fail++;
> > }
> > 
> > that way (or something similar) we can select the behaviour on a per-fs
> > basis and the rest of the kernel remains unaffected.  Other subsystems
> > can opt in as well.
> 
> This is basically leading to GFP_MAYFAIL which is completely backwards
> (the hard requirement should be an exception not a default rule).
> I really do not want to end up with stuffing random may_fail annotations
> all over the kernel.
> 

I wish that GFP_NOFS / GFP_NOIO regions are annotated with

  static inline void enter_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code++;
  #endif
  }

  static inline void leave_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code--;
  #endif
  }

  static inline void enter_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code++;
  #endif
  }

  static inline void leave_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code--;
  #endif
  }

so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
by doing

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +#ifdef CONFIG_DEBUG_GFP_FLAGS
 +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
 +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
 +#endif
  
  	lockdep_trace_alloc(gfp_mask);
  

. It is difficult for non-fs developers to determine whether they need to use
GFP_NOFS than GFP_KERNEL in their code. An example is seen at
http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

Moreover, I don't know how GFP flags are managed when stacked like
"a swap file on ext4 on top of LVM (with snapshots) on a RAID array
connected over iSCSI" (quoted from comments on Jon's writeup), but I
wish that the distinction between GFP_KERNEL / GFP_NOFS / GFP_NOIO
are removed from memory allocating function callers by doing

  static inline void enter_fs_code(void)
  {
  	current->in_fs_code++;
  }

  static inline void leave_fs_code(void)
  {
  	current->in_fs_code--;
  }

  static inline void enter_io_code(void)
  {
  	current->in_io_code++;
  }

  static inline void leave_io_code(void)
  {
  	current->in_io_code--;
  }

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +	if (current->in_fs_code)
 +		gfp_mask &= ~__GFP_FS;
 +	if (current->in_io_code)
 +		gfp_mask &= ~__GFP_IO;
  
  	lockdep_trace_alloc(gfp_mask);
  

so that GFP flags passed to memory allocations involved by stacking
will be appropriately masked.

WARNING: multiple messages have this Message-ID (diff)
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@suse.cz, akpm@linux-foundation.org
Cc: hannes@cmpxchg.org, david@fromorbit.com, mgorman@suse.de,
	riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] Move away from non-failing small allocations
Date: Tue, 17 Mar 2015 23:06:34 +0900	[thread overview]
Message-ID: <201503172305.DIH52162.FOFMFOVJHLOtQS@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20150317090738.GB28112@dhcp22.suse.cz>

Michal Hocko wrote:
> On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> > Realistically, I don't think this overall effort will be successful -
> > we'll add the knob, it won't get enough testing and any attempt to
> > alter the default will be us deliberately destabilizing the kernel
> > without knowing how badly :(
> 
> Without the knob we do not allow users to test this at all though and
> the transition will _never_ happen. Which is IMHO bad.
> 

Even with the knob, quite little users will test this. The consequence is
likely that end users rush into customer support center about obscure bugs.
I'm working at a support center, and such bugs are really annoying.

> > I wonder if we can alter the behaviour only for filesystem code, so we
> > constrain the new behaviour just to that code where we're having
> > problems.  Most/all fs code goes via vfs methods so there's a reasonably
> > small set of places where we can call
> 
> We are seeing issues with the fs code now because the test cases which
> led to the current discussion exercise FS code. The code which does
> lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> we can find other subsystems if we try hard enough.

I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).

> > static inline void enter_fs_code(struct super_block *sb)
> > {
> > 	if (sb->my_small_allocations_can_fail)
> > 		current->small_allocations_can_fail++;
> > }
> > 
> > that way (or something similar) we can select the behaviour on a per-fs
> > basis and the rest of the kernel remains unaffected.  Other subsystems
> > can opt in as well.
> 
> This is basically leading to GFP_MAYFAIL which is completely backwards
> (the hard requirement should be an exception not a default rule).
> I really do not want to end up with stuffing random may_fail annotations
> all over the kernel.
> 

I wish that GFP_NOFS / GFP_NOIO regions are annotated with

  static inline void enter_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code++;
  #endif
  }

  static inline void leave_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code--;
  #endif
  }

  static inline void enter_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code++;
  #endif
  }

  static inline void leave_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code--;
  #endif
  }

so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
by doing

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +#ifdef CONFIG_DEBUG_GFP_FLAGS
 +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
 +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
 +#endif
  
  	lockdep_trace_alloc(gfp_mask);
  

. It is difficult for non-fs developers to determine whether they need to use
GFP_NOFS than GFP_KERNEL in their code. An example is seen at
http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

Moreover, I don't know how GFP flags are managed when stacked like
"a swap file on ext4 on top of LVM (with snapshots) on a RAID array
connected over iSCSI" (quoted from comments on Jon's writeup), but I
wish that the distinction between GFP_KERNEL / GFP_NOFS / GFP_NOIO
are removed from memory allocating function callers by doing

  static inline void enter_fs_code(void)
  {
  	current->in_fs_code++;
  }

  static inline void leave_fs_code(void)
  {
  	current->in_fs_code--;
  }

  static inline void enter_io_code(void)
  {
  	current->in_io_code++;
  }

  static inline void leave_io_code(void)
  {
  	current->in_io_code--;
  }

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +	if (current->in_fs_code)
 +		gfp_mask &= ~__GFP_FS;
 +	if (current->in_io_code)
 +		gfp_mask &= ~__GFP_IO;
  
  	lockdep_trace_alloc(gfp_mask);
  

so that GFP flags passed to memory allocations involved by stacking
will be appropriately masked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-03-17 14:06 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-11 20:54 [PATCH 0/2] Move away from non-failing small allocations Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` [PATCH 1/2] mm: Allow small allocations to fail Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-12 12:54   ` Tetsuo Handa
2015-03-12 12:54     ` Tetsuo Handa
2015-03-12 13:12     ` Michal Hocko
2015-03-12 13:12       ` Michal Hocko
2015-03-15  5:43   ` Tetsuo Handa
2015-03-15  5:43     ` Tetsuo Handa
2015-03-15 12:13     ` Michal Hocko
2015-03-15 12:13       ` Michal Hocko
2015-03-15 13:06       ` Tetsuo Handa
2015-03-15 13:06         ` Tetsuo Handa
2015-03-16  7:46         ` [PATCH 1/2 v2] " Michal Hocko
2015-03-16  7:46           ` Michal Hocko
2015-03-16 21:11           ` Johannes Weiner
2015-03-16 21:11             ` Johannes Weiner
2015-03-17 10:25             ` Michal Hocko
2015-03-17 10:25               ` Michal Hocko
2015-03-17 13:29               ` Johannes Weiner
2015-03-17 13:29                 ` Johannes Weiner
2015-03-17 14:17                 ` Michal Hocko
2015-03-17 14:17                   ` Michal Hocko
2015-03-17 17:26                   ` Johannes Weiner
2015-03-17 17:26                     ` Johannes Weiner
2015-03-17 19:41                     ` Michal Hocko
2015-03-17 19:41                       ` Michal Hocko
2015-03-18  9:10                       ` Vlastimil Babka
2015-03-18  9:10                         ` Vlastimil Babka
2015-03-18 12:04                         ` Michal Hocko
2015-03-18 12:04                           ` Michal Hocko
2015-03-18 12:36                         ` Tetsuo Handa
2015-03-18 12:36                           ` Tetsuo Handa
2015-03-18 11:35                       ` Tetsuo Handa
2015-03-18 11:35                         ` Tetsuo Handa
2015-03-17 11:13           ` Tetsuo Handa
2015-03-17 11:13             ` Tetsuo Handa
2015-03-17 13:15             ` Michal Hocko
2015-03-17 13:15               ` Michal Hocko
2015-03-18 11:33               ` Tetsuo Handa
2015-03-18 11:33                 ` Tetsuo Handa
2015-03-18 12:23                 ` Michal Hocko
2015-03-18 12:23                   ` Michal Hocko
2015-03-19 11:03                   ` Tetsuo Handa
2015-03-19 11:03                     ` Tetsuo Handa
2015-03-11 20:54 ` [PATCH 2/2] mmotm: Enable small allocation " Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 22:36 ` [PATCH 0/2] Move away from non-failing small allocations Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-16 22:38 ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-17  9:07   ` Michal Hocko
2015-03-17  9:07     ` Michal Hocko
2015-03-17 14:06     ` Tetsuo Handa [this message]
2015-03-17 14:06       ` Tetsuo Handa
2015-04-02 11:53       ` Tetsuo Handa
2015-04-02 11:53         ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201503172305.DIH52162.FOFMFOVJHLOtQS@I-love.SAKURA.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.