All of lore.kernel.org
 help / color / mirror / Atom feed
* (unknown), 
@ 2013-06-25  9:25 Nagachandra P
  2013-06-26 14:02 ` Memory allocation can cause ext4 filesystem to be remounted r/o Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Nagachandra P @ 2013-06-25  9:25 UTC (permalink / raw)
  To: Vikram MP; +Cc: linux-ext4

Hi,

Here are some details on the platform

Linux kernel version - 3.4.5
Android - 4.2.2
ext4 mounted with *errors=panic* option.

We see memory allocation failures mostly caused by low memory kill the
ext4 process which is waiting for a allocation on slow path. (below is
one such instance)

select 26413 (AsyncTask #3), score_adj 647, adj 10,size 15287, to kill
send sigkill to 26413 (AsyncTask #3), score_adj 647,adj 10, size 15287
with ofree -450 10896, cfree 27845 984 msa 529 ma 8
AsyncTask #3: page allocation failure: order:0, mode:0x80050
[<c001595c>] (unwind_backtrace+0x0/0x11c) from [<c00dc064>]
(warn_alloc_failed+0xe8/0x110)
[<c00dc064>] (warn_alloc_failed+0xe8/0x110) from [<c00dee54>]
(__alloc_pages_nodemask+0x6d4/0x800)
[<c00dee54>] (__alloc_pages_nodemask+0x6d4/0x800) from [<c05fe250>]
(cache_alloc_refill+0x30c/0x6a4)
[<c05fe250>] (cache_alloc_refill+0x30c/0x6a4) from [<c0104eb4>]
(kmem_cache_alloc+0xa0/0x1b8)
[<c0104eb4>] (kmem_cache_alloc+0xa0/0x1b8) from [<c0192c34>]
(ext4_free_blocks+0x9c4/0xa08)
[<c0192c34>] (ext4_free_blocks+0x9c4/0xa08) from [<c01873bc>]
(ext4_ext_remove_space+0x690/0xd9c)
[<c01873bc>] (ext4_ext_remove_space+0x690/0xd9c) from [<c01897f8>]
(ext4_ext_truncate+0x100/0x1c8)
[<c01897f8>] (ext4_ext_truncate+0x100/0x1c8) from [<c016447c>]
(ext4_truncate+0xf4/0x194)
[<c016447c>] (ext4_truncate+0xf4/0x194) from [<c0166cc0>]
(ext4_setattr+0x36c/0x3f8)
[<c0166cc0>] (ext4_setattr+0x36c/0x3f8) from [<c011f540>]
(notify_change+0x1dc/0x2a8)
[<c011f540>] (notify_change+0x1dc/0x2a8) from [<c0107cc8>]
(do_truncate+0x74/0x90)
[<c0107cc8>] (do_truncate+0x74/0x90) from [<c0107e20>]
(do_sys_ftruncate+0x13c/0x144)
[<c0107e20>] (do_sys_ftruncate+0x13c/0x144) from [<c0108020>]
(sys_ftruncate+0x18/0x1c)
[<c0108020>] (sys_ftruncate+0x18/0x1c) from [<c000e140>]
(ret_fast_syscall+0x0/0x48)

followed by....

SLAB: Unable to allocate memory on node 0 (gfp=0x80050)
  cache: ext4_free_data, object size: 64, order: 0
  node 0: slabs: 0/0, objs: 0/0, free: 0
EXT4-fs error (device mmcblk0p26) in ext4_free_blocks:4700: Out of memory
Aborting journal on device mmcblk0p26-8.
EXT4-fs error (device mmcblk0p26): ext4_journal_start_sb:328: Detected
aborted journal
EXT4-fs (mmcblk0p26): Remounting filesystem read-only
Kernel panic - not syncing: EXT4-fs panic from previous error

[<c001595c>] (unwind_backtrace+0x0/0x11c) from [<c05fc5b4>] (panic+0x80/0x1cc)
[<c05fc5b4>] (panic+0x80/0x1cc) from [<c017ddec>] (__ext4_abort+0xc0/0xe0)
[<c017ddec>] (__ext4_abort+0xc0/0xe0) from [<c017dfa0>]
(ext4_journal_start_sb+0x194/0x1c4)
[<c017dfa0>] (ext4_journal_start_sb+0x194/0x1c4) from [<c0168c60>]
(ext4_dirty_inode+0x14/0x40)
[<c0168c60>] (ext4_dirty_inode+0x14/0x40) from [<c01293c0>]
(__mark_inode_dirty+0x2c/0x1b4)
[<c01293c0>] (__mark_inode_dirty+0x2c/0x1b4) from [<c011d6b8>]
(file_update_time+0xfc/0x11c)
[<c011d6b8>] (file_update_time+0xfc/0x11c) from [<c00d8f34>]
(__generic_file_aio_write+0x2d8/0x40c)
[<c00d8f34>] (__generic_file_aio_write+0x2d8/0x40c) from [<c00d90c8>]
(generic_file_aio_write+0x60/0xc8)
[<c00d90c8>] (generic_file_aio_write+0x60/0xc8) from [<c015f74c>]
(ext4_file_write+0x244/0x2b4)
[<c015f74c>] (ext4_file_write+0x244/0x2b4) from [<c0108a38>]
(do_sync_write+0x9c/0xd8)
[<c0108a38>] (do_sync_write+0x9c/0xd8) from [<c0109304>] (vfs_write+0xb0/0x128)
[<c0109304>] (vfs_write+0xb0/0x128) from [<c010953c>] (sys_write+0x38/0x64)
[<c010953c>] (sys_write+0x38/0x64) from [<c000e140>] (ret_fast_syscall+0x0/0x48)

Is there a way in which we could avoid ext4 panic caused by allocation
failure (a method other than setting errors=continue :-) )? (or is
memory allocation failure considered as fatal as any other IO error)

Thanks
Nagachandra

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-25  9:25 (unknown), Nagachandra P
@ 2013-06-26 14:02 ` Theodore Ts'o
  2013-06-26 14:54   ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-26 14:02 UTC (permalink / raw)
  To: Nagachandra P; +Cc: Vikram MP, linux-ext4

On Tue, Jun 25, 2013 at 02:55:33PM +0530, Nagachandra P wrote:
> 
> Here are some details on the platform
> 
> Linux kernel version - 3.4.5
> Android - 4.2.2
> ext4 mounted with *errors=panic* option.
> 
> We see memory allocation failures mostly caused by low memory kill the
> ext4 process which is waiting for a allocation on slow path. (below is
> one such instance)
>
> Is there a way in which we could avoid ext4 panic caused by allocation
> failure (a method other than setting errors=continue :-) )? (or is
> memory allocation failure considered as fatal as any other IO error)

In this particular case, we could reflect the error all the way up to
the ftruncate(2) system call.  Fixing this is going to be a bit
involved, unfortunately; we'll need to update a fairly large number of
function signatures, including ext4_truncate(), ext4_ext_truncate(),
ext4_free_blocks(), and a number of others.

One of the problems is that there are code paths, such as ext4's
evict_inode() call, where there is the potential that if there was a
file descriptor holding the inode open at the time when it was
unlinked, we can only delete the file (which involves a call to
ext4_truncate) in ext4_evict_inode(), and there isn't a good error
recovery path in that case.

Probably the best short-term fix for now is to add a flag used by
ext4_free_blocks() which retries the memory allocation in a loop (see
the retry_alloc loop in jbd2_journal_write_metadata_buffer() in
fs/jbd2/journal.c) and then initially add this flag to all of the
callers of ext4_free_blocks().

We'll then need to fix the various callers where we can reflect the
error back to userspace to do so, and then drop the flag.  In the case
of ext4_evict_inode(), what we can do is to call ext4_truncate() inode
truncation in the unlink() system call if there are no other file
descriptors keeping the inode from being deleted immediately.

	    	    	      	   	 - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 14:02 ` Memory allocation can cause ext4 filesystem to be remounted r/o Theodore Ts'o
@ 2013-06-26 14:54   ` Theodore Ts'o
  2013-06-26 15:20     ` Nagachandra P
  2013-06-26 18:53     ` Joseph D. Wagner
  0 siblings, 2 replies; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-26 14:54 UTC (permalink / raw)
  To: Nagachandra P; +Cc: Vikram MP, linux-ext4

On Wed, Jun 26, 2013 at 10:02:05AM -0400, Theodore Ts'o wrote:
> 
> In this particular case, we could reflect the error all the way up to
> the ftruncate(2) system call.  Fixing this is going to be a bit
> involved, unfortunately; we'll need to update a fairly large number of
> function signatures, including ext4_truncate(), ext4_ext_truncate(),
> ext4_free_blocks(), and a number of others.

One thing that comes to mind.  If we change things so that ftruncate
reflects an ENOMEM error all the way up to userspace, one side effect
of this is that the file may be partially truncated when ENOMEM is
returned.  Applications may not be prepared for this.

There would be a similar issue if we do the truncate in the unlink
call and return ENOMEM in case of a failure, the file might not be
unlinked, and in fact we might have a partially truncated file in the
directory, which would probably cause all sorts of confusion.  So
we're probably better off, putting the inode on a list of inodes in
memory, and on the orphan list on disk, and then retry the truncation
when memory is available.  Messy, but that probably gives the best
result for applications living constantly in high memory pressure
environments.

							- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 14:54   ` Theodore Ts'o
@ 2013-06-26 15:20     ` Nagachandra P
  2013-06-26 16:34       ` Theodore Ts'o
  2013-06-26 18:53     ` Joseph D. Wagner
  1 sibling, 1 reply; 12+ messages in thread
From: Nagachandra P @ 2013-06-26 15:20 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vikram MP, linux-ext4

Thanks Theodore,

We also have seen case where the current allocation itself could cause
the lowmem shrinker to be called (which in-turn chooses the same
process for killing because of oom_adj_value of the current process,
oom_adj_value is a weight age value associated with each process based
on which the android low memory killer would select a process for
killing to get memory). If we chose to retry in such case we could end
up in endless loop of retrying the allocation. It would be better to
handle this without retrying.

We could your above suggestion which could address this specific path.
But, there are quiet a number of allocation in ext4 which could call
ext4_std_error on failure and we may need to look each one of them to
see on how do we handle each one of them. Do think this something that
could be done?

We have in the past tried some ugly hacks to workaround the problem
(by adjusting oom_adj_values, guarding them from being killed) but
they don't seem provide fool proof mechanism at high memory pressure
environment. Any advice on what we could try to fix the issue in
general would be appreciated?

Thanks again.

Best regards
Nagachandra

On Wed, Jun 26, 2013 at 8:24 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Jun 26, 2013 at 10:02:05AM -0400, Theodore Ts'o wrote:
>>
>> In this particular case, we could reflect the error all the way up to
>> the ftruncate(2) system call.  Fixing this is going to be a bit
>> involved, unfortunately; we'll need to update a fairly large number of
>> function signatures, including ext4_truncate(), ext4_ext_truncate(),
>> ext4_free_blocks(), and a number of others.
>
> One thing that comes to mind.  If we change things so that ftruncate
> reflects an ENOMEM error all the way up to userspace, one side effect
> of this is that the file may be partially truncated when ENOMEM is
> returned.  Applications may not be prepared for this.
>
> There would be a similar issue if we do the truncate in the unlink
> call and return ENOMEM in case of a failure, the file might not be
> unlinked, and in fact we might have a partially truncated file in the
> directory, which would probably cause all sorts of confusion.  So
> we're probably better off, putting the inode on a list of inodes in
> memory, and on the orphan list on disk, and then retry the truncation
> when memory is available.  Messy, but that probably gives the best
> result for applications living constantly in high memory pressure
> environments.
>
>                                                         - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 15:20     ` Nagachandra P
@ 2013-06-26 16:34       ` Theodore Ts'o
  2013-06-26 17:05         ` Nagachandra P
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-26 16:34 UTC (permalink / raw)
  To: Nagachandra P; +Cc: Vikram MP, linux-ext4

On Wed, Jun 26, 2013 at 08:50:50PM +0530, Nagachandra P wrote:
> 
> We also have seen case where the current allocation itself could cause
> the lowmem shrinker to be called (which in-turn chooses the same
> process for killing because of oom_adj_value of the current process,
> oom_adj_value is a weight age value associated with each process based
> on which the android low memory killer would select a process for
> killing to get memory). If we chose to retry in such case we could end
> up in endless loop of retrying the allocation. It would be better to
> handle this without retrying.

The challenge is that in some cases there's no good way to return an
error back upwards, and in other cases, the ability to back out of the
middle of a file system operation is incredibly hard.  This is why we
have the retry loop in the jbd2 code; the presumption is that some
other process is scheduable, so that allows other processes to exit
when the OOM killer takes out other processes.

It's not an ideal solution, but in practice it's been good enough.  In
general the OOM killer will be able to take out some other process and
free up memory that way.

Are you seeing this a lot?  If so, I think it's fair to ask why; from
what I can tell it's not a situation that is happening often on most
systems using ext4 (including Android devices, of which I have
several).

> We could your above suggestion which could address this specific path.
> But, there are quiet a number of allocation in ext4 which could call
> ext4_std_error on failure and we may need to look each one of them to
> see on how do we handle each one of them. Do think this something that
> could be done?

There aren't that many places where ext4 does memory allocations,
actually.  And once you exclude those which are used when the file
system is initially mounted, there is quite a manageable number.  It's
probably better to audit all of those and to make sure we have a good
error recovery if any of these calls to kmalloc() or
kmem_cache_alloc() fail.

In many of the cases where we end up calling ext4_std_error(), the
most common cause of is an I/O error while trying to read some
critical metadata block, and in that case, declaring that the file
system is corrupted is in fact the appropriate thing to do.

> We have in the past tried some ugly hacks to workaround the problem
> (by adjusting oom_adj_values, guarding them from being killed) but
> they don't seem provide fool proof mechanism at high memory pressure
> environment. Any advice on what we could try to fix the issue in
> general would be appreciated?

What version of the kernel are using?  And do you understand why you
are under so much memory pressure?  Is it due to applications not
getting killed quickly enough?  Are applications dirtying too much
memory too quickly?  Is write throttling not working?  Or are they
allocating too much memory when they start up their JVM?  Or is it
just that your Android device has far less memory than most of the
other devices out there?

Speaking generally, if you're regularly seeing that kmem_cache_alloc
failing, that means free memory has fallen to zero.  Which to me
sounds like the OOM killer should be trying to kill processes more
aggressively, and more generally you should be trying to be trying to
make sure the kernel is maintaining a somewhat larger amount of free
memory.  The fact that you mentioned trying to prevent certain
processes from being killed may mean that you are approaching this
problem from the wrong direction.  It may be more fruitful to
encourage the system to kill those user applications that most
deserving _earlier_.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 16:34       ` Theodore Ts'o
@ 2013-06-26 17:05         ` Nagachandra P
  2013-06-26 18:03           ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Nagachandra P @ 2013-06-26 17:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vikram MP, linux-ext4

Hi Theodore,

Kernel version we are using is 3.4.5 (AOSP based).

These issue are not easy to reproduce!!! We are running multiple
applications (of different memory size) over a period of a 24 hrs to
36 hrs and we hit this once. We have seen these issues easier to
reproduce typically with around 512MB memory (may be in about 16 hrs -
20 hrs), and harder to reproduce with 1GB memory.

Most of the time we get into these situation are when an application
(Typically AsyncTasks in Android) that is doing ext4 fs ops are of low
adj values (> 9, typically 10 - 12) and hence would be fairly gullible
to be killed (and there would be no way to distinguish this from
application perspective), this is one of the challenges we are facing.
Also, here we are don't have to completely be out of memory (but just
withing the LMK band for the process adj value).

But, on rethinking your idea on retrying may work if we have some
tweaks in LMK as well (like killing multiple tasks instead of just
one).

Thanks
Naga

On Wed, Jun 26, 2013 at 10:04 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Jun 26, 2013 at 08:50:50PM +0530, Nagachandra P wrote:
>>
>> We also have seen case where the current allocation itself could cause
>> the lowmem shrinker to be called (which in-turn chooses the same
>> process for killing because of oom_adj_value of the current process,
>> oom_adj_value is a weight age value associated with each process based
>> on which the android low memory killer would select a process for
>> killing to get memory). If we chose to retry in such case we could end
>> up in endless loop of retrying the allocation. It would be better to
>> handle this without retrying.
>
> The challenge is that in some cases there's no good way to return an
> error back upwards, and in other cases, the ability to back out of the
> middle of a file system operation is incredibly hard.  This is why we
> have the retry loop in the jbd2 code; the presumption is that some
> other process is scheduable, so that allows other processes to exit
> when the OOM killer takes out other processes.
>
> It's not an ideal solution, but in practice it's been good enough.  In
> general the OOM killer will be able to take out some other process and
> free up memory that way.
>
> Are you seeing this a lot?  If so, I think it's fair to ask why; from
> what I can tell it's not a situation that is happening often on most
> systems using ext4 (including Android devices, of which I have
> several).
>
>> We could your above suggestion which could address this specific path.
>> But, there are quiet a number of allocation in ext4 which could call
>> ext4_std_error on failure and we may need to look each one of them to
>> see on how do we handle each one of them. Do think this something that
>> could be done?
>
> There aren't that many places where ext4 does memory allocations,
> actually.  And once you exclude those which are used when the file
> system is initially mounted, there is quite a manageable number.  It's
> probably better to audit all of those and to make sure we have a good
> error recovery if any of these calls to kmalloc() or
> kmem_cache_alloc() fail.
>
> In many of the cases where we end up calling ext4_std_error(), the
> most common cause of is an I/O error while trying to read some
> critical metadata block, and in that case, declaring that the file
> system is corrupted is in fact the appropriate thing to do.
>
>> We have in the past tried some ugly hacks to workaround the problem
>> (by adjusting oom_adj_values, guarding them from being killed) but
>> they don't seem provide fool proof mechanism at high memory pressure
>> environment. Any advice on what we could try to fix the issue in
>> general would be appreciated?
>
> What version of the kernel are using?  And do you understand why you
> are under so much memory pressure?  Is it due to applications not
> getting killed quickly enough?  Are applications dirtying too much
> memory too quickly?  Is write throttling not working?  Or are they
> allocating too much memory when they start up their JVM?  Or is it
> just that your Android device has far less memory than most of the
> other devices out there?
>
> Speaking generally, if you're regularly seeing that kmem_cache_alloc
> failing, that means free memory has fallen to zero.  Which to me
> sounds like the OOM killer should be trying to kill processes more
> aggressively, and more generally you should be trying to be trying to
> make sure the kernel is maintaining a somewhat larger amount of free
> memory.  The fact that you mentioned trying to prevent certain
> processes from being killed may mean that you are approaching this
> problem from the wrong direction.  It may be more fruitful to
> encourage the system to kill those user applications that most
> deserving _earlier_.
>
> Regards,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 17:05         ` Nagachandra P
@ 2013-06-26 18:03           ` Theodore Ts'o
  2013-06-27 12:58             ` Nagachandra P
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-26 18:03 UTC (permalink / raw)
  To: Nagachandra P; +Cc: Vikram MP, linux-ext4

On Wed, Jun 26, 2013 at 10:35:22PM +0530, Nagachandra P wrote:
> 
> These issue are not easy to reproduce!!! We are running multiple
> applications (of different memory size) over a period of a 24 hrs to
> 36 hrs and we hit this once. We have seen these issues easier to
> reproduce typically with around 512MB memory (may be in about 16 hrs -
> 20 hrs), and harder to reproduce with 1GB memory.
> 
> Most of the time we get into these situation are when an application
> (Typically AsyncTasks in Android) that is doing ext4 fs ops are of low
> adj values (> 9, typically 10 - 12) and hence would be fairly gullible
> to be killed (and there would be no way to distinguish this from
> application perspective), this is one of the challenges we are facing.
> Also, here we are don't have to completely be out of memory (but just
> withing the LMK band for the process adj value).

To be clear, if the application is killed by the low memory killer,
we're not going to trigger the ext4_std_err() codepath.  The
ext4_std_error() is getting called because free memory has fallen to
_zero_ and so kmem_cache_alloc() returns an error.  Should ext4 do a
better job with handling this?  Yes, absolutely.  I do consider this a
fs bug that we should try to fix.  The reality though is if that free
memory has gone to zero, it's going to put multiple kernel subsystems
under stress.

It is good to hear that this is only happening on highly memory
constrained devices --- speaking as a owner of a Nexus 4 with 2GB of
memory.  :-P

That's why the bigger issue is why did free memory go to zero in the
first place?  That means the LMK was probably not being aggressive
enough, or something started consuming a lot of memory too quickly,
before the page cleaner and write throttling algorithms could kick in
and try to deal with it.

> But, on rethinking your idea on retrying may work if we have some
> tweaks in LMK as well (like killing multiple tasks instead of just
> one).

You might also consider looking at tweaking the mm low watermark and
minimum watermark.  See the tunable /proc/sys/vm/min_free_kbytes.

You might want to just simply try monitorinig the free memory levels
on a continuous basis, and see how often it's dropping below some
minimum level.  This will allow you to give you a figure of merit by
which you can try tuning your system, without needing to wait for a
file system error.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 14:54   ` Theodore Ts'o
  2013-06-26 15:20     ` Nagachandra P
@ 2013-06-26 18:53     ` Joseph D. Wagner
  2013-06-26 22:14       ` Theodore Ts'o
  1 sibling, 1 reply; 12+ messages in thread
From: Joseph D. Wagner @ 2013-06-26 18:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Nagachandra P, Vikram MP, linux-ext4

On 06/26/2013 7:54 am, Theodore Ts'o wrote:

> On Wed, Jun 26, 2013 at 10:02:05AM -0400, Theodore Ts'o wrote:
> 
>> In this particular case, we could reflect the error all the way up to
>> the ftruncate(2) system call. Fixing this is going to be a bit
>> involved, unfortunately; we'll need to update a fairly large number 
>> of
>> function signatures, including ext4_truncate(), ext4_ext_truncate(),
>> ext4_free_blocks(), and a number of others.
> 
> One thing that comes to mind. If we change things so that ftruncate
> reflects an ENOMEM error all the way up to userspace, one side effect
> of this is that the file may be partially truncated when ENOMEM is
> returned. Applications may not be prepared for this.

Hi Ted, it's the newbie again.  I'd like to throw out a possible
band-aid, which I know is ugly, but I'm not sure how it compares to
other ideas discussed.

What if there was a check at the start of the chain for free memory?
For example:

  1. User program calls function_x(parameter y).
  2. We know function_x() calls function_a(), function_b(), and
     function_c().
  3. Based upon our knowledge of those functions (and perhaps
     parameter y), we can _estimate_ that function_x() will require
     z bytes memory.
  4. Alter function_x() so that the first step is to check for free
     memory z.

Upside
  1. Obvious memory shortages are returned immediately, instead of 30
     steps down the chain.
  2. No risk of non-deterministic data changes (if caught; see 
downside).
  2. No risk of infinite loop due to retries.
  3. Puts a spotlight on applications that do not correctly handle
     ENOMEM, which to me is the equivalent of not correctly calling
     fsync().

Downside
  1. Does not guarantee that memory will be available when ext4 needs
     its.  Memory might be available during this pre-check, but another
     process might scoop it up between the pre-check and ext4's
     allocation.
  2. Does not catch all cases.  The check is only an estimate.

Thank you for your patience and for answering my questions.

Joseph D. Wagner

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 18:53     ` Joseph D. Wagner
@ 2013-06-26 22:14       ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-26 22:14 UTC (permalink / raw)
  To: Joseph D. Wagner; +Cc: Nagachandra P, Vikram MP, linux-ext4

On Wed, Jun 26, 2013 at 11:53:12AM -0700, Joseph D. Wagner wrote:
>  1. Does not guarantee that memory will be available when ext4 needs
>     its.  Memory might be available during this pre-check, but another
>     process might scoop it up between the pre-check and ext4's
>     allocation.

This is the huge one.  In some cases we might need to read potentially
one or more disk blocks in the course of servicing the request.  By
the time we've read in the disk blocks, hundreds of milliseconds can
have gone by.  If there is a high speed network transfer coming in
over the network, the networking stack can chew up a huge amount of
memory surprisingly quickly (to say nothing of a JVM that might be
starting up in parallel with reading in the allocation bitmaps, for
example).

This is also assuming we know in advance how much memory would be
needed.  Depending on how fragmented a file might be, the amount of
memory required can vary significantly.  And if the required metadata
blocks are in memory, we won't know how much memory will be needed
until we pull the necessary blocks into memory.

The other problem with doing this is that the point at which we would
do the check for the necessary memory is at the high-level portions of
the ext4, and the places where we are doing the memory allocation are
sometimes deep in the guts of ext4.  So figuring this out would
require some truly nasty abstraction violations.  For the same reason,
we can't just simply allocate the memory before we start the
file system operation.

There are places where we could do this, without doing severe violence
to the surrounding code and without making things a total maintenance
nightmare.  But it's one of those things where we'd have to look at
each place where we allocate memory and decide what's the best way to
handle things.

							- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-26 18:03           ` Theodore Ts'o
@ 2013-06-27 12:58             ` Nagachandra P
  2013-06-27 17:36               ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Nagachandra P @ 2013-06-27 12:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vikram MP, linux-ext4

Hi Theodore,

Could you point me to the code where ext4_std_err is not triggered
because of LMK? As I see it, if a memory allocation returns error in
some of the case ext4_std_error would invariably be called. Please
consider the following call stack

send sigkill to 5648 (id.app.sbrowser), score_adj 1000,adj 15, size
13257 with ofree -2010 20287, cfree 18597 902 msa 1000 ma 15
id.app.sbrowser: page allocation failure: order:0, mode:0x50
[<c0013aa8>] (unwind_backtrace+0x0/0x11c) from [<c00d6530>]
(warn_alloc_failed+0xe8/0x110)
[<c00d6530>] (warn_alloc_failed+0xe8/0x110) from [<c00d9308>]
(__alloc_pages_nodemask+0x6d4/0x804)
[<c00d9308>] (__alloc_pages_nodemask+0x6d4/0x804) from [<c00d2b34>]
(find_or_create_page+0x40/0x84)
[<c00d2b34>] (find_or_create_page+0x40/0x84) from [<c0188858>]
(ext4_mb_load_buddy+0xd4/0x2b4)
[<c0188858>] (ext4_mb_load_buddy+0xd4/0x2b4) from [<c018c69c>]
(ext4_free_blocks+0x5d4/0xa08)
[<c018c69c>] (ext4_free_blocks+0x5d4/0xa08) from [<c0181218>]
(ext4_ext_remove_space+0x690/0xd9c)
[<c0181218>] (ext4_ext_remove_space+0x690/0xd9c) from [<c0183654>]
(ext4_ext_truncate+0x100/0x1c8)
[<c0183654>] (ext4_ext_truncate+0x100/0x1c8) from [<c015e2ec>]
(ext4_truncate+0xf4/0x194)
[<c015e2ec>] (ext4_truncate+0xf4/0x194) from [<c01629dc>]
(ext4_evict_inode+0x3b4/0x4ac)
[<c01629dc>] (ext4_evict_inode+0x3b4/0x4ac) from [<c011871c>] (evict+0x8c/0x150)
[<c011871c>] (evict+0x8c/0x150) from [<c010f030>] (do_unlinkat+0xdc/0x134)
[<c010f030>] (do_unlinkat+0xdc/0x134) from [<c000e100>]
(ret_fast_syscall+0x0/0x30)

The failure to allocate memory in above case is because of the kill
signal received.

__alloc_pages_slowpath would return NULL in case its received a KILL
signal. (I don't see any code in 3.4.5 that would check for something
similar to TIF_MEMDIE to make an decision on whether to call
ext4_std_error or not, is this added recently).

Thanks
Naga

On Wed, Jun 26, 2013 at 11:33 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Jun 26, 2013 at 10:35:22PM +0530, Nagachandra P wrote:
>>
>> These issue are not easy to reproduce!!! We are running multiple
>> applications (of different memory size) over a period of a 24 hrs to
>> 36 hrs and we hit this once. We have seen these issues easier to
>> reproduce typically with around 512MB memory (may be in about 16 hrs -
>> 20 hrs), and harder to reproduce with 1GB memory.
>>
>> Most of the time we get into these situation are when an application
>> (Typically AsyncTasks in Android) that is doing ext4 fs ops are of low
>> adj values (> 9, typically 10 - 12) and hence would be fairly gullible
>> to be killed (and there would be no way to distinguish this from
>> application perspective), this is one of the challenges we are facing.
>> Also, here we are don't have to completely be out of memory (but just
>> withing the LMK band for the process adj value).
>
> To be clear, if the application is killed by the low memory killer,
> we're not going to trigger the ext4_std_err() codepath.  The
> ext4_std_error() is getting called because free memory has fallen to
> _zero_ and so kmem_cache_alloc() returns an error.  Should ext4 do a
> better job with handling this?  Yes, absolutely.  I do consider this a
> fs bug that we should try to fix.  The reality though is if that free
> memory has gone to zero, it's going to put multiple kernel subsystems
> under stress.
>
> It is good to hear that this is only happening on highly memory
> constrained devices --- speaking as a owner of a Nexus 4 with 2GB of
> memory.  :-P
>
> That's why the bigger issue is why did free memory go to zero in the
> first place?  That means the LMK was probably not being aggressive
> enough, or something started consuming a lot of memory too quickly,
> before the page cleaner and write throttling algorithms could kick in
> and try to deal with it.
>
>> But, on rethinking your idea on retrying may work if we have some
>> tweaks in LMK as well (like killing multiple tasks instead of just
>> one).
>
> You might also consider looking at tweaking the mm low watermark and
> minimum watermark.  See the tunable /proc/sys/vm/min_free_kbytes.
>
> You might want to just simply try monitorinig the free memory levels
> on a continuous basis, and see how often it's dropping below some
> minimum level.  This will allow you to give you a figure of merit by
> which you can try tuning your system, without needing to wait for a
> file system error.
>
> Cheers,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-27 12:58             ` Nagachandra P
@ 2013-06-27 17:36               ` Theodore Ts'o
  2013-06-28 13:52                 ` Nagachandra P
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-06-27 17:36 UTC (permalink / raw)
  To: Nagachandra P; +Cc: Vikram MP, linux-ext4

On Thu, Jun 27, 2013 at 06:28:21PM +0530, Nagachandra P wrote:
> Hi Theodore,
> 
> Could you point me to the code where ext4_std_err is not triggered
> because of LMK? As I see it, if a memory allocation returns error in
> some of the case ext4_std_error would invariably be called. Please
> consider the following call stack

Yes, that's one example where a memory allocation failure can lead to
ext4_std_error() getting called, and I've already acknowledged that's
one that we need to fix (although as I said, fixing it may be tricky,
short of calling congestion_wait() and then retrying the allocation,
and hoping that in the meantime the OOM killer has freed up some
memory).

If you'd could give me a list of other memory allocations where
ext4_std_error() could get called, please let me know.  Note that in
the jbd2 layer, though, we handle a memory allocation failure by
retrying the allocation, to avoid this the file system getting marked
read/only.  Examples of this include in jbd2_journal_write_metadata_buffer(),
and in jbd2_journal_add_journal_head() when it calls
journal_alloc_journal_head().  (Although the way we're doing the retry
in the latter case is a bit ugly and we're not sleeping with a call to
congestion_wait(), so it's something we should clean up.)

To give you an example of the intended use of ext4_std_error(), if the
journal commit code runs into a disk I/O error while writing to the
journal, the jbd2 code has to mark the journal as aborted.  This could
happen because the disk has gone off-line, or the HDD has run out of
spare disk sectors in its bad block replacement pool, so it has to
return a write error to the OS.  Once the journal has been marked as
aborted, the next time the ext4 code tries to access the journal, by
starting a new journal handle, or marking a metadata block dirty, the
jbd2 function will return an error, and this will cause
ext4_std_error() to be called so the file system can be marked as
requiring a file system check.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation can cause ext4 filesystem to be remounted r/o
  2013-06-27 17:36               ` Theodore Ts'o
@ 2013-06-28 13:52                 ` Nagachandra P
  0 siblings, 0 replies; 12+ messages in thread
From: Nagachandra P @ 2013-06-28 13:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vikram MP, linux-ext4

Thanks a lot for explaining this.

I will have a look into the jbd2 code for having similar
implementation on ext4 as well. I will keep you posted on any patches
we try out and get your opinion.

Best regards
Naga

On Thu, Jun 27, 2013 at 11:06 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Jun 27, 2013 at 06:28:21PM +0530, Nagachandra P wrote:
>> Hi Theodore,
>>
>> Could you point me to the code where ext4_std_err is not triggered
>> because of LMK? As I see it, if a memory allocation returns error in
>> some of the case ext4_std_error would invariably be called. Please
>> consider the following call stack
>
> Yes, that's one example where a memory allocation failure can lead to
> ext4_std_error() getting called, and I've already acknowledged that's
> one that we need to fix (although as I said, fixing it may be tricky,
> short of calling congestion_wait() and then retrying the allocation,
> and hoping that in the meantime the OOM killer has freed up some
> memory).
>
> If you'd could give me a list of other memory allocations where
> ext4_std_error() could get called, please let me know.  Note that in
> the jbd2 layer, though, we handle a memory allocation failure by
> retrying the allocation, to avoid this the file system getting marked
> read/only.  Examples of this include in jbd2_journal_write_metadata_buffer(),
> and in jbd2_journal_add_journal_head() when it calls
> journal_alloc_journal_head().  (Although the way we're doing the retry
> in the latter case is a bit ugly and we're not sleeping with a call to
> congestion_wait(), so it's something we should clean up.)
>
> To give you an example of the intended use of ext4_std_error(), if the
> journal commit code runs into a disk I/O error while writing to the
> journal, the jbd2 code has to mark the journal as aborted.  This could
> happen because the disk has gone off-line, or the HDD has run out of
> spare disk sectors in its bad block replacement pool, so it has to
> return a write error to the OS.  Once the journal has been marked as
> aborted, the next time the ext4 code tries to access the journal, by
> starting a new journal handle, or marking a metadata block dirty, the
> jbd2 function will return an error, and this will cause
> ext4_std_error() to be called so the file system can be marked as
> requiring a file system check.
>
> Regards,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-28 13:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-25  9:25 (unknown), Nagachandra P
2013-06-26 14:02 ` Memory allocation can cause ext4 filesystem to be remounted r/o Theodore Ts'o
2013-06-26 14:54   ` Theodore Ts'o
2013-06-26 15:20     ` Nagachandra P
2013-06-26 16:34       ` Theodore Ts'o
2013-06-26 17:05         ` Nagachandra P
2013-06-26 18:03           ` Theodore Ts'o
2013-06-27 12:58             ` Nagachandra P
2013-06-27 17:36               ` Theodore Ts'o
2013-06-28 13:52                 ` Nagachandra P
2013-06-26 18:53     ` Joseph D. Wagner
2013-06-26 22:14       ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.