All of lore.kernel.org
 help / color / mirror / Atom feed
* How to handle an kmalloc failure in evict_inode()?
@ 2014-08-04 23:41 Theodore Ts'o
  2014-08-05  1:13 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2014-08-04 23:41 UTC (permalink / raw)
  To: linux-fsdevel


I've been trying to figure out the best way to handle potential memory
allocation failures in evict_inode(), since there doesn't seem to be any
way to reflect errors back up to the caller, or to delay the inode
eviction until more memory might be available.

There doesn't seem to be a good solution; right now, in ext4, we mark
the file system as being inconsistent, thus triggering a remount
read-only or a panic, which seems to be.... sub-optimal.

I was looking to see what xfs does to handle this case, and as near as I
can tell, xfs_fs_evict_inode() calls xfs_inactive(), which in turn calls
xfs_free_eofblocks() --- and ignores the error return.  And when I look
to see whether xfs_free_eofblocks() can return ENOMEM, it appears that
it can, via a call path that involves xfs_bmapi_read(),
xfs_iread_extents(), xfs_bmap_read_extents(), xfs_btree_read_bufl(),
xfs_trans_read_buf(), xfs_trans_read_buf_map(), xfs_buf_read_map(),
xfs_buf_get_map(), _xfs_buf_alloc(), which can return ENOMEM.

As near as I can tell, at least for ext4, we've been remarkably lucky,
in that GFP_NOFS allocations seem to rarely if ever fail.  However,
under enough memory pressure, and in a situation where the OOM killer
has been configured to be less aggressive, it is possible (which is as
we would expect) for a kmalloc() to fail, and short of using
_GFP_NOFAIL, or using a retry loop, I'm not sure there's a good answer
to this problem.

Am I missing something obvious?  Has someone come up with a good way of
dealing with the need to release an inode and its blocks in
evict_inode(), and what to do if a memory allocation fails for some
reason?

Thanks,

                                                - Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-04 23:41 How to handle an kmalloc failure in evict_inode()? Theodore Ts'o
@ 2014-08-05  1:13 ` Dave Chinner
  2014-08-05  1:53   ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-08-05  1:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

On Mon, Aug 04, 2014 at 07:41:01PM -0400, Theodore Ts'o wrote:
> 
> I've been trying to figure out the best way to handle potential memory
> allocation failures in evict_inode(), since there doesn't seem to be any
> way to reflect errors back up to the caller, or to delay the inode
> eviction until more memory might be available.
> 
> There doesn't seem to be a good solution; right now, in ext4, we mark
> the file system as being inconsistent, thus triggering a remount
> read-only or a panic, which seems to be.... sub-optimal.
> 
> I was looking to see what xfs does to handle this case, and as near as I
> can tell, xfs_fs_evict_inode() calls xfs_inactive(), which in turn calls
> xfs_free_eofblocks() --- and ignores the error return.

Sure, because failing to free EOF blocks on referenced inodes (i.e. link
count > 0) is not a serious problem. i.e. freeing speculative preallocation
beyond EOF is a best effort operation so lock inversions or memory
allocation failure simply means we don't do it. It'll get cleaned up
in the future (i.e. next time the inode gets pulled into cache).

> And when I look
> to see whether xfs_free_eofblocks() can return ENOMEM, it appears that
> it can, via a call path that involves xfs_bmapi_read(),
> xfs_iread_extents(), xfs_bmap_read_extents(), xfs_btree_read_bufl(),
> xfs_trans_read_buf(), xfs_trans_read_buf_map(), xfs_buf_read_map(),
> xfs_buf_get_map(), _xfs_buf_alloc(), which can return ENOMEM.

Yes, _xfs_buf_alloc() can return ENOMEM, but did you actually look
at the memory allocation code? i.e. the kmem_*alloc(KM_NOFS) calls?
Put simple, KM_NOFS allocations *never fail* in XFS, unless
KM_NOSLEEP (i.e. GFP_ATOMIC) or KM_MAYFAIL are also specified. So
_xfs_buf_alloc() will never actually return ENOMEM at this point.


Historically speaking, the core XFS code has little in the way of
ENOMEM error handling because the OS it came from (Irix) guaranteed
that critical memory allocations would always succeed. The
kmem_*alloc() wrappers still implement those semantics today on
Linux for the core XFS code.

We're slowly adding all the ENOMEM handling code we need through the
stack to handle ENOMEM sanely. However, until we have transaction
rollback code (i.e to cancel dirty transactions), we can't ever fail
memory allocations in transactions. Hence the error handling code is
slowly appearing, but the code allocator still doesn't allow failure
to occur....

> As near as I can tell, at least for ext4, we've been remarkably lucky,
> in that GFP_NOFS allocations seem to rarely if ever fail.

Oh, they fail quite frequently....

> However,
> under enough memory pressure, and in a situation where the OOM killer
> has been configured to be less aggressive, it is possible (which is as
> we would expect) for a kmalloc() to fail, and short of using
> _GFP_NOFAIL, or using a retry loop, I'm not sure there's a good answer
> to this problem.

Yup, that's why XFS uses __GFP_NOWARN and a retry loop - because
there are places where failure to allocate memory can have only
one result: denial of service via a filesystem shutdown.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-05  1:13 ` Dave Chinner
@ 2014-08-05  1:53   ` Theodore Ts'o
  2014-08-05 12:17     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2014-08-05  1:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Tue, Aug 05, 2014 at 11:13:56AM +1000, Dave Chinner wrote:
> Sure, because failing to free EOF blocks on referenced inodes (i.e. link
> count > 0) is not a serious problem. i.e. freeing speculative preallocation
> beyond EOF is a best effort operation so lock inversions or memory
> allocation failure simply means we don't do it. It'll get cleaned up
> in the future (i.e. next time the inode gets pulled into cache).

I was worried about the case where the link count == 0 and
evict_inode() needs to release the inode --- and you get a memory
allocation failure.

I didn't realize that xfs's kmem_*alloc() functions uses a retry loop.
I think these days the preferred solution from the MM folks is to use
GFP_NOFAIL (although for a long time folks like David Rientjes were
trying to claim that both GFP_NOFAIL and looping was evil, and that
all kernel code had to be able to handle memory allocation failures).

As near as I can tell, in the case of evict_inode and link_count == 0,
using either GFP_NOFAIL or GFP_NOWARN and a retry loop is the only way
to deal because in the evict_inode and link_count == 0 case, failure
is simply not an option.

> Yup, that's why XFS uses __GFP_NOWARN and a retry loop - because
> there are places where failure to allocate memory can have only
> one result: denial of service via a filesystem shutdown.

Great, the next time David Rijentes whines at me, I'll direct him in
your direction, and we can flame him together.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-05  1:53   ` Theodore Ts'o
@ 2014-08-05 12:17     ` Dave Chinner
  2014-08-05 17:21       ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-08-05 12:17 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

On Mon, Aug 04, 2014 at 09:53:22PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 05, 2014 at 11:13:56AM +1000, Dave Chinner wrote:
> > Sure, because failing to free EOF blocks on referenced inodes (i.e. link
> > count > 0) is not a serious problem. i.e. freeing speculative preallocation
> > beyond EOF is a best effort operation so lock inversions or memory
> > allocation failure simply means we don't do it. It'll get cleaned up
> > in the future (i.e. next time the inode gets pulled into cache).
> 
> I was worried about the case where the link count == 0 and
> evict_inode() needs to release the inode --- and you get a memory
> allocation failure.

Even with XFS, that's not a big deal. We need code to handle it, but
evict/destroy_inode doesn't actually free the inode n XFS - it simply
removes it from the visibility of the VFS and marks it as
reclaimable in internal structures.

Reclaiming the inode and freeing the memory is done by another
shrinker - the fs specific one attached to the superblock - and so
in XFS we have the /potential/ to "fail" to evict inodes for as
long as memory allocation fails because we can redo the operation
inside the filesystem....

IOWs, the longer term plan is to move all this stuff to async
workqueue processing and so be able to defer and batch unlink and
reclaim work more efficiently:

http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink
http://xfs.org/index.php/Improving_inode_Caching#Reclaim_Optimizations

> I didn't realize that xfs's kmem_*alloc() functions uses a retry loop.
> I think these days the preferred solution from the MM folks is to use
> GFP_NOFAIL (although for a long time folks like David Rientjes were
> trying to claim that both GFP_NOFAIL and looping was evil, and that
> all kernel code had to be able to handle memory allocation failures).

I've always just ignored that "looping is evil" stupidity because
it's a necessity. Nobody has tried to remove it from XFS, so until
then ignorance is bliss....

> As near as I can tell, in the case of evict_inode and link_count == 0,
> using either GFP_NOFAIL or GFP_NOWARN and a retry loop is the only way
> to deal because in the evict_inode and link_count == 0 case, failure
> is simply not an option.

*nod*

> > Yup, that's why XFS uses __GFP_NOWARN and a retry loop - because
> > there are places where failure to allocate memory can have only
> > one result: denial of service via a filesystem shutdown.
> 
> Great, the next time David Rijentes whines at me, I'll direct him in
> your direction, and we can flame him together.  :-)

Fun for everyone. Popcorn? ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-05 12:17     ` Dave Chinner
@ 2014-08-05 17:21       ` Theodore Ts'o
  2014-08-05 22:12         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2014-08-05 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Tue, Aug 05, 2014 at 10:17:17PM +1000, Dave Chinner wrote:
> IOWs, the longer term plan is to move all this stuff to async
> workqueue processing and so be able to defer and batch unlink and
> reclaim work more efficiently:

> http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink

I discussed doing this for ext4 a while back (because on a very busy
machine, unlink latency can be quite large).  I got pushback because
people were concerned that if a very large directory is getting
deleted --- say, you're cleaning up a the directory belonging to a
(for example, a Docker / Borg / Omega) job that has been shut down, so
the equivalent of an "rm -rf" of several hundred files comprising tens
or hundreds of megabytes or gigabytes, the fact that all of the unlink
have returned without the space not being available could confuse a
number of programs.  And it's not just "df", but if the user is over
quota, the fact that they still aren't allowed to write for seconds or
minutes because the block release isn't taking place except in a
workqueue that could potentially get deferred for a non-trivial amount
of time.

I could imagine recruiting the process that tries to do a block
allocation that would otherwise would have failed with a ENOSPC or
EDQUOT to help with the completing the deallocation of inodes to help
release disk space, but then we're moving the latency variability from
the unlink() call to an otherwise innocent production job that is
trying to do file writes.  So the user visibility is more than just
the df statistics; it's also some file writes either failing or
suffering increased latency until the blocks can be reclaimed.

Has the XFS developers considered these sort of concerns, and are
there any solutions to these issues that you've contemplated?

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-05 17:21       ` Theodore Ts'o
@ 2014-08-05 22:12         ` Dave Chinner
  2014-08-09 10:48           ` Tetsuo Handa
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-08-05 22:12 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

On Tue, Aug 05, 2014 at 01:21:23PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 05, 2014 at 10:17:17PM +1000, Dave Chinner wrote:
> > IOWs, the longer term plan is to move all this stuff to async
> > workqueue processing and so be able to defer and batch unlink and
> > reclaim work more efficiently:
> 
> > http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink
> 
> I discussed doing this for ext4 a while back (because on a very busy
> machine, unlink latency can be quite large).  I got pushback because
> people were concerned that if a very large directory is getting
> deleted --- say, you're cleaning up a the directory belonging to a
> (for example, a Docker / Borg / Omega) job that has been shut down, so
> the equivalent of an "rm -rf" of several hundred files comprising tens
> or hundreds of megabytes or gigabytes, the fact that all of the unlink
> have returned without the space not being available could confuse a
> number of programs.  And it's not just "df", but if the user is over
> quota, the fact that they still aren't allowed to write for seconds or
> minutes because the block release isn't taking place except in a
> workqueue that could potentially get deferred for a non-trivial amount
> of time.

I'm not concerned about space usage in general - XFS already does
things that cause "unexpected" space usage issues (e.g. dynamic
speculative preallocation) and so we've demonstrated how to deal
with such issues. That is, make the radical change of behaviour and
then temper the new behaviour such that it doesn't affect users and
applications adversely whilst still maintaining the benefits the
change was intended to provide.

The reality is that nobody really even notices dynamic specualtive
prealloc anymore because we've refined it to only have short-term
impact on space usage and have triggers to reduce, turn off and/or
reclaim speculative prealloc if free space is low or the workload is
adversely affected by it. The short-term differences in df, du and
space usage just don't matter....

Background unlink is no different. If large amounts of space freeing
are deferred, we can kick the queue to run sooner than it's default
period.  We can account space in deferred inactivations as "delayed
freeing" for statfs() and so hide such behaviour from userspace
completely.  If the user hits edquot, we can run a scan
to truncate any inodes accounts to that quota id that are in reclaim
state. Same for ENOSPC.

> I could imagine recruiting the process that tries to do a block
> allocation that would otherwise would have failed with a ENOSPC or
> EDQUOT to help with the completing the deallocation of inodes to
> help release disk space, but then we're moving the latency
> variability from the unlink() call to an otherwise innocent
> production job that is trying to do file writes.  So the user
> visibility is more than just the df statistics; it's also some
> file writes either failing or suffering increased latency until
> the blocks can be reclaimed.

Sure, but we already do that to free up delayed allocation metadata
reservations (i.e. run fs-wide writeback) and freeing speculative
preallocation (eofblocks scan) before we fail the write with
ENOSPC/EDQUOT. It's a rare slow path, already has extremely variable
(and long!) latencies, so the overhead of adding more inode/space
reclaim work does not really change anything fundamental.

There's a simple principle of system design: if a latency-sensitive
application is running anywhere near slow paths, then the design is
fundamentally wrong. i.e. if someone choses to run a
latency-sensitive app at EDQUOT/ENOSPC then that's not a problem we
can solve at the XFS design level, and as such we've neer tried to
solve. Indeed, best practices say you shouldn't run such
applications on XFS filesystems more than 85% full.....

> Has the XFS developers considered these sort of concerns, and are
> there any solutions to these issues that you've contemplated?

I think it is obvious we've been walking this path for quite some
time now. ;)

The fundamental observation is that the vast majority of XFS
filesystem operation occurs when there is ample free space.
Trading off increased slowpath latency variation for increased
fast-path bulk throughput rates and improved resilience against
filesystem aging artifacts is exactly the right thing to be doing
for the vast majority of XFS users. We have always done this in XFS,
especially w.r.t. improving scalability.

Fact is, I'm quite happy to be flamed by people who think that such
behavioural changes is the work of the $ANTIDEITY. If we are not
making some people unhappy, then we are not pushing the boundaries
of what is possible enough. The key is to listen to why people are
unhappy about the change and to then address those concerns without
compromising the benefits of the original change.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-05 22:12         ` Dave Chinner
@ 2014-08-09 10:48           ` Tetsuo Handa
  2014-08-09 15:43             ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Tetsuo Handa @ 2014-08-09 10:48 UTC (permalink / raw)
  To: david, tytso; +Cc: linux-fsdevel

It seems to me that you are talking about only cases where memory allocation
functions return NULL instead of waiting forever. But that is an optimistic
assumption. What I'm observing under extreme memory pressure with XFS
( https://lkml.org/lkml/2014/7/2/249 ) is that memory allocation functions spin
forever while keeping some of kernel threads locked. Do you have ideas for
solving (or at least detecting) this behavior?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to handle an kmalloc failure in evict_inode()?
  2014-08-09 10:48           ` Tetsuo Handa
@ 2014-08-09 15:43             ` Theodore Ts'o
  0 siblings, 0 replies; 8+ messages in thread
From: Theodore Ts'o @ 2014-08-09 15:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: david, linux-fsdevel

On Sat, Aug 09, 2014 at 07:48:26PM +0900, Tetsuo Handa wrote:
> It seems to me that you are talking about only cases where memory allocation
> functions return NULL instead of waiting forever. But that is an optimistic
> assumption. What I'm observing under extreme memory pressure with XFS
> ( https://lkml.org/lkml/2014/7/2/249 ) is that memory allocation functions spin
> forever while keeping some of kernel threads locked. Do you have ideas for
> solving (or at least detecting) this behavior?

This is an mm problem.  I don't doubt that there are probably some
things that the mm can do a better job, but under a malicious /
abusive enough workload, the only solution is to OOM kill one or more
offending processes.

The mm layer can try to be more efficient at packing a certain amount
of sh*t into a bag, but ultimately, if you have 10 pounds of sh*t, and
a 5 pound bag, the bag is going to burst.  The only question is how
the bag is going to burst, and does so via just killing processes, or
whether you get some file system corruption in addition, as a bonus.

This is why I suspect the right answer is for the file system to use
GFP_NOFAIL instead of doing the retry loop inside the file system.
That gives the mm layer an easier way of getting a signal that things
are going bad, since the looping is done inside the slab allocator,
and so the mm can use this signal to decide if the time is right to
kick up the aggressiveness of the OOM killer.

     	    		      	      - Ted






^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-08-09 15:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-04 23:41 How to handle an kmalloc failure in evict_inode()? Theodore Ts'o
2014-08-05  1:13 ` Dave Chinner
2014-08-05  1:53   ` Theodore Ts'o
2014-08-05 12:17     ` Dave Chinner
2014-08-05 17:21       ` Theodore Ts'o
2014-08-05 22:12         ` Dave Chinner
2014-08-09 10:48           ` Tetsuo Handa
2014-08-09 15:43             ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.