linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* tmpfs fails fallocate(more than DRAM)
@ 2019-02-18 13:34 Adam Borowski
  2019-02-18 15:15 ` Matthew Wilcox
  2019-02-18 20:25 ` Adam Borowski
  0 siblings, 2 replies; 5+ messages in thread
From: Adam Borowski @ 2019-02-18 13:34 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: Marcin Ślusarz

Hi!
There's something that looks like a bug in tmpfs' implementation of
fallocate.  If you try to fallocate more than the available DRAM (yet
with plenty of swap space), it will evict everything swappable out
then fail, undoing all the work done so far first.

The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
posix_allocate(), but our documentation doesn't mention ENOMEM for
Linux-specific fallocate() either).

Doing the same allocation in multiple calls -- be it via non-overlapping
calls or even with same offset but increasing len -- works as expected.

An example:
Machine has 32GB RAM, minus 4GB memmapped as fake pmem.  No big tasks
(X, some shells, browser, ...).  Run 「while :;do free -m;done」 on another
terminal, then:

# mount -osize=64G -t tmpfs none /mnt/vol1
# chown you /mnt/vol1
$ cd /mnt/vol1
$ fallocate -l 32G foo
fallocate: fallocate failed: Cannot allocate memory
$ fallocate -l 28G foo
fallocate: fallocate failed: Cannot allocate memory
$ fallocate -l 27G foo
fallocate: fallocate failed: Cannot allocate memory
$ fallocate -l 26G foo
$ fallocate -l 52G foo

It takes a few seconds for the allocation to succeed, then a couple for it
to be torn down if it fails.  More if it has to writeout the zeroes it
allocated in the previous call.

This raises multiple questions:
* why would fallocate bother to prefault the memory instead of just
  reserving it?  We want to kill overcommit, but reserving swap is as good
  -- if there's memory pressure, our big allocation will be evicted anyway.
* why does it insist on doing everything in one piece?  Biggest chunk I
  see to be beneficial is 1G (for hugepages).
* when it fails, why does it undo the work done so far?  This can matter
  for other reasons, such as EINTR -- and fallocate isn't expected to be
  atomic anyway.
* if I'm wrong and atomicity+prefaulting are desired, why does fallocate
  forces just the delta (pages not yet allocated) to reside in core, rather
  than the entire requested range?

Thus, I believe fallocate on tmpfs should behave consistently with other
filesystems and succeed unless we run into ENOSPC.

Am I missing something?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?
⠈⠳⣄⠀⠀⠀⠀


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tmpfs fails fallocate(more than DRAM)
  2019-02-18 13:34 tmpfs fails fallocate(more than DRAM) Adam Borowski
@ 2019-02-18 15:15 ` Matthew Wilcox
  2019-02-18 20:25 ` Adam Borowski
  1 sibling, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2019-02-18 15:15 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-mm, linux-fsdevel, Marcin Ślusarz

On Mon, Feb 18, 2019 at 02:34:23PM +0100, Adam Borowski wrote:
> The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
> posix_allocate(), but our documentation doesn't mention ENOMEM for
> Linux-specific fallocate() either).

Returning -ENOMEM rather than -ENOSPC in this situation is clearly
wrong, but just about every system call can return -ENOMEM.  It might
not even be due to memory allocation failure ... these days it's just
"I am unusually short on resources, you've done nothing wrong, but I
can't handle it right now".


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tmpfs fails fallocate(more than DRAM)
  2019-02-18 13:34 tmpfs fails fallocate(more than DRAM) Adam Borowski
  2019-02-18 15:15 ` Matthew Wilcox
@ 2019-02-18 20:25 ` Adam Borowski
  2019-02-19  3:35   ` Hugh Dickins
  1 sibling, 1 reply; 5+ messages in thread
From: Adam Borowski @ 2019-02-18 20:25 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, Hugh Dickins; +Cc: Marcin Ślusarz

Hi Hugh, it turns out this problem is caused by your commit
1aac1400319d30786f32b9290e9cc923937b3d57:

On Mon, Feb 18, 2019 at 02:34:23PM +0100, Adam Borowski wrote:
> There's something that looks like a bug in tmpfs' implementation of
> fallocate.  If you try to fallocate more than the available DRAM (yet
> with plenty of swap space), it will evict everything swappable out
> then fail, undoing all the work done so far first.
> 
> The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
> posix_allocate(), but our documentation doesn't mention ENOMEM for
> Linux-specific fallocate() either).
> 
> Doing the same allocation in multiple calls -- be it via non-overlapping
> calls or even with same offset but increasing len -- works as expected.

I don't quite understand your logic there -- it seems to be done on purpose?

#   tmpfs: quit when fallocate fills memory
#   
#   As it stands, a large fallocate() on tmpfs is liable to fill memory with
#   pages, freed on failure except when they run into swap, at which point
#   they become fixed into the file despite the failure.  That feels quite
#   wrong, to be consuming resources precisely when they're in short supply.

The page cache is just a cache, and thus running out of DRAM is in no way a
failure (as long as there's enough underlying storage).  Like any other
filesystem, once DRAM is full, tmpfs is supposed to start writeout.  A smart
filesystem can mark zero pages as SWAP_MAP_FALLOC to avoid physically
writing them out but doing so the naive hard way is at least correct.
    
#   Go the other way instead: shmem_fallocate() indicate the range it has
#   fallocated to shmem_writepage(), keeping count of pages it's allocating;
#   shmem_writepage() reactivate instead of swapping out pages fallocated by
#   this syscall (but happily swap out those from earlier occasions), keeping
#   count; shmem_fallocate() compare counts and give up once the reactivated
#   pages have started to coming back to writepage (approximately: some zones
#   would in fact recycle faster than others).

It's a weird inconsistency: why should space allocated in a previous call
act any different from that we allocate right now?
    
#   This is a little unusual, but works well: although we could consider the
#   failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
#   added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.

It breaks use of tmpfs as a regular filesystem.  In particular, you don't
know that a program someone uses won't try to create a big file.  For
example, Debian buildds (where I first hit this problem) have setups such
as:
< jcristau> kilobyte: fwiw x86-csail-01.d.o has 75g /srv/buildd tmpfs, 8g ram, 89g swap

Using tmpfs this way is reasonable: traditional filesystems spend a lot of
effort to ensure crash consistency, and even if you disable journaling and
barriers, they will pointlessly write out the files.  Most builds can
succeed in far less than 8GB, not touching the disk even once.

[...]

> This raises multiple questions:
> * why would fallocate bother to prefault the memory instead of just
>   reserving it?  We want to kill overcommit, but reserving swap is as good
>   -- if there's memory pressure, our big allocation will be evicted anyway.

I see that this particular feature is not coded yet for swap.

> * why does it insist on doing everything in one piece?  Biggest chunk I
>   see to be beneficial is 1G (for hugepages).

At the moment, a big fallocate evicts all other swappable pages.  Doing it
piece by piece would at least allow swapping out memory it just allocated
(if we don't yet have a way to mark it up without physically writing
zeroes).

> * when it fails, why does it undo the work done so far?  This can matter
>   for other reasons, such as EINTR -- and fallocate isn't expected to be
>   atomic anyway.

I searched a bit for references that would suggest failed fallocates need to
be undone, and I can't seem to find any.  Neither POSIX nor our man pages
say a word about semantics of interrupted fallocate, and both glibc's and
FreeBSD's fallback emulation don't rollback.


But, as my understanding seems to go nearly the opposite way as your commit
message, am I getting it wrong?  It's you not me who's a mm regular...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?
⠈⠳⣄⠀⠀⠀⠀


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tmpfs fails fallocate(more than DRAM)
  2019-02-18 20:25 ` Adam Borowski
@ 2019-02-19  3:35   ` Hugh Dickins
  2019-02-19  4:16     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Hugh Dickins @ 2019-02-19  3:35 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-mm, linux-fsdevel, Hugh Dickins, Marcin Slusarz

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9855 bytes --]

On Mon, 18 Feb 2019, Adam Borowski wrote:

> Hi Hugh, it turns out this problem is caused by your commit
> 1aac1400319d30786f32b9290e9cc923937b3d57:

Yes, part of the series which first enabled fallocate() on tmpfs.
You probably read most of them already, but if not, please do read
through those v3.5 commit comments on

e2d12e22c59c tmpfs: support fallocate preallocation
1635f6a74152 tmpfs: undo fallocation on failure
1aac1400319d tmpfs: quit when fallocate fills memory

where I said more about the awkward compromises made
than I would be able to bring back to mind today.

> 
> On Mon, Feb 18, 2019 at 02:34:23PM +0100, Adam Borowski wrote:
> > There's something that looks like a bug in tmpfs' implementation of
> > fallocate.  If you try to fallocate more than the available DRAM (yet
> > with plenty of swap space), it will evict everything swappable out
> > then fail, undoing all the work done so far first.
> > 
> > The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
> > posix_allocate(), but our documentation doesn't mention ENOMEM for
> > Linux-specific fallocate() either).

I can't speak for UNIX and its other relations, but it's well established
on Linux that the absence of a listed errno from the POSIX manpage or our
own manpage is no guarantee that that errno will not be returned by the
system call in question.  Those lists are really helpful for documenting
a variety of special meanings, but don't expect them to cover everything.

(Though I see that I was relieved to find EINTR given in the manpage.)

And as Matthew already said, ENOMEM is one that can very easily come back
from many system calls.  Though I disagree that it's wrong here: ENOSPC
is the errno you get when your fallocate() reaches the block limit (if
any) of the filesystem, ENOMEM is one you may hit earlier if it's unable
to complete the fallocate() successfully with the memory currently
available.

Fallocate is not the only place where tmpfs has to make that distinction:
ENOSPC for the filesystem constraint, ENOMEM for running out of memory
(itself ambiguous: physical memory available? swap included? memcg limit?
memory overcommit limitation?).

> > 
> > Doing the same allocation in multiple calls -- be it via non-overlapping
> > calls or even with same offset but increasing len -- works as expected.

Its indeterminacy is the worst thing about it, I think. I suppose that
procedure will often work, because of each attempt pushing more out to
swap.  But I certainly agree that it's all an unsatisfactory compromise.

As I remark in one of those commit messages, I very much wish that
fallocate(2) had been defined to return a positive count on success,
to allow for partial success like write(2); but too late to change by
the time I came along.

> 
> I don't quite understand your logic there -- it seems to be done on purpose?
> 
> #   tmpfs: quit when fallocate fills memory
> #   
> #   As it stands, a large fallocate() on tmpfs is liable to fill memory with
> #   pages, freed on failure except when they run into swap, at which point
> #   they become fixed into the file despite the failure.  That feels quite
> #   wrong, to be consuming resources precisely when they're in short supply.
> 
> The page cache is just a cache, and thus running out of DRAM is in no way a
> failure (as long as there's enough underlying storage).  Like any other
> filesystem, once DRAM is full, tmpfs is supposed to start writeout.  A smart
> filesystem can mark zero pages as SWAP_MAP_FALLOC to avoid physically
> writing them out but doing so the naive hard way is at least correct.

I suggest below that we have different perceptions of tmpfs:
I see it as a RAM-based filesystem, with swap overflow; you see it
as a swap-based filesystem, caching in RAM.  I think that if it were
the latter, we'd have spent a lot more time designing its swap layout.

>     
> #   Go the other way instead: shmem_fallocate() indicate the range it has
> #   fallocated to shmem_writepage(), keeping count of pages it's allocating;
> #   shmem_writepage() reactivate instead of swapping out pages fallocated by
> #   this syscall (but happily swap out those from earlier occasions), keeping
> #   count; shmem_fallocate() compare counts and give up once the reactivated
> #   pages have started to coming back to writepage (approximately: some zones
> #   would in fact recycle faster than others).
> 
> It's a weird inconsistency: why should space allocated in a previous call
> act any different from that we allocate right now?

"weird" I'll agree with (and you're not the first person to use the word
"weird" of tmpfs in the last week!) but "inconsistency", in that context,
no.  Space allocated in a previous call has been guaranteed to the caller,
and that guarantee is a likely to be what they wanted fallocate() for in
the first place.  Space allocated right now, before we return success or
failure from the system call, is still revocable.

>     
> #   This is a little unusual, but works well: although we could consider the
> #   failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
> #   added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.
> 
> It breaks use of tmpfs as a regular filesystem.  In particular, you don't
> know that a program someone uses won't try to create a big file.  For
> example, Debian buildds (where I first hit this problem) have setups such
> as:
> < jcristau> kilobyte: fwiw x86-csail-01.d.o has 75g /srv/buildd tmpfs, 8g ram, 89g swap
> 
> Using tmpfs this way is reasonable: traditional filesystems spend a lot of
> effort to ensure crash consistency, and even if you disable journaling and
> barriers, they will pointlessly write out the files.  Most builds can
> succeed in far less than 8GB, not touching the disk even once.

Yes, unsatisfactory: I tried for the best compromise I could imagine.
fallocate() on tmpfs remains useful in most circumstances, but with
this peculiar failure mode once going beyond RAM and well into swap.

With that 8G/89G split, I think you perceive tmpfs as a swap-based
filesystem, whereas I perceive it as a RAM-based filesystem which uses
swap for overflow; so made compromises appropriate to that view.

> 
> [...]
> 
> > This raises multiple questions:
> > * why would fallocate bother to prefault the memory instead of just
> >   reserving it?  We want to kill overcommit, but reserving swap is as good
> >   -- if there's memory pressure, our big allocation will be evicted anyway.

The only way I know of to reserve memory, respecting all the different
limiting mechanisms imposed (memcg limits, filesystem limits, zone
watermarks, ...), is to allocate it (not sure what you mean by prefault).
hugetlbfs does have a reservation system, and its very own pool of memory,
but that's not tmpfs.

> 
> I see that this particular feature is not coded yet for swap.

I expect you're right, but I don't see what you're referring to there:
ah, probably the SWAP_MAP_FALLOC mentioned above, from a comment in
shmem_writepage().  Yes, not implemented: it would handle a rare case
more efficiently, but I don't think it would change the fundamentals
at all.  Or maybe it's too long since I thought through this area,
and it really would make a real difference - dunno.

> 
> > * why does it insist on doing everything in one piece?  Biggest chunk I
> >   see to be beneficial is 1G (for hugepages).

It insists on attempting to do what you ask: if you ask for one big piece,
that's what it tries for.

> 
> At the moment, a big fallocate evicts all other swappable pages.  Doing it
> piece by piece would at least allow swapping out memory it just allocated
> (if we don't yet have a way to mark it up without physically writing
> zeroes).
> 
> > * when it fails, why does it undo the work done so far?  This can matter
> >   for other reasons, such as EINTR -- and fallocate isn't expected to be
> >   atomic anyway.
> 
> I searched a bit for references that would suggest failed fallocates need to
> be undone, and I can't seem to find any.  Neither POSIX nor our man pages
> say a word about semantics of interrupted fallocate, and both glibc's and
> FreeBSD's fallback emulation don't rollback.

To me it was self-evident: with a few awkward exceptions (awkward because
they would have a difficult job to undo, and awkward because they argue
against me!), a system call either succeeds or fails, or reports partial
success.  If fallocate() says it failed (and is not allowed to report
partial success), then it should not have allocated.  Especially in the
case of RAM, when filling it up makes it rather hard to unfill (another
persistent problem with tmpfs is the way it can occupy all of memory,
and the OOM killer go about killing a thousand processes, but none of
them help because the memory is occupied by a tmpfs, not by a process).

Now that you question it (did I not do so at the time? I thought I did),
I try fallocate() on btrfs and ext4 and xfs.  btrfs and xfs behave as I
expect above, failing outright with ENOSPC if it will not fit; whereas
ext4 proceeds to fill up the filesystem, leaving it full when it says
that it failed.  Looks like I had a choice of models to follow: the
ext4 model would have been easier to follow, but risked OOM.

> 
> But, as my understanding seems to go nearly the opposite way as your commit
> message, am I getting it wrong?  It's you not me who's a mm regular...
> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁
> ⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?

Actually, no.  Would s/he have a useful insight to share on fallocate()?

Hugh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tmpfs fails fallocate(more than DRAM)
  2019-02-19  3:35   ` Hugh Dickins
@ 2019-02-19  4:16     ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2019-02-19  4:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Adam Borowski, linux-mm, linux-fsdevel, Marcin Slusarz

On Mon, Feb 18, 2019 at 07:35:01PM -0800, Hugh Dickins wrote:
> On Mon, 18 Feb 2019, Adam Borowski wrote:
> > I searched a bit for references that would suggest failed fallocates need to
> > be undone, and I can't seem to find any.  Neither POSIX nor our man pages
> > say a word about semantics of interrupted fallocate, and both glibc's and
> > FreeBSD's fallback emulation don't rollback.
> 
> To me it was self-evident: with a few awkward exceptions (awkward because
> they would have a difficult job to undo, and awkward because they argue
> against me!), a system call either succeeds or fails, or reports partial
> success.  If fallocate() says it failed (and is not allowed to report
> partial success), then it should not have allocated.  Especially in the
> case of RAM, when filling it up makes it rather hard to unfill (another
> persistent problem with tmpfs is the way it can occupy all of memory,
> and the OOM killer go about killing a thousand processes, but none of
> them help because the memory is occupied by a tmpfs, not by a process).
> 
> Now that you question it (did I not do so at the time? I thought I did),
> I try fallocate() on btrfs and ext4 and xfs.  btrfs and xfs behave as I
> expect above, failing outright with ENOSPC if it will not fit;

If only it were that simple. :/

XFS can do partial allocation and fail - it all depends on how many
extent allocations are required before ENOSPC is actually hit. e.g.
if you ask for 10GB and there is only 5GB free, it should fail
straight away. However, if there's 20GB free in 1GB chunks, it will
loop allocating 1GB extents. If something else is allocating at the
same time, the fallocate could get to, say, 8GB allocated and then
hit ENOSPC.

In which case, we'll return the ENOSPC error, but we'll also leave
the 8GB of space already allocated to the file there. i.e. it
doesn't clean up after itself.

The reason for this is that we don't know after we've performed
allocations what regions of the preallocated range were actually
allocated by the preallocation. i.e. fallocate can be run over a
range that already contains some extents - it simply skips over
regions that are already allocated. hence we don't know what we are
supposed to clean up, and so we leave the corpse lying around for
someone else to deal with (e.g. by sparsifying the file again).

> whereas
> ext4 proceeds to fill up the filesystem, leaving it full when it says
> that it failed.

This is much the same behaviour as XFS - you see it more easily with
ext4 because it has much smaller maximum extent size (128MB) than
XFS (8GB) and so needs to iterate multiple allocations sooner than
XFS or btrfs need to.

I'm not sure what btrfs does

> Looks like I had a choice of models to follow: the
> ext4 model would have been easier to follow, but risked OOM.

fallocate() gives you the rope to choose what is best for the
filesystem - it doesn't specify behaviour on failure precisely
because it can be very difficult (not to mention complex!) for
filesystems to unwind partial failures....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-19  4:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18 13:34 tmpfs fails fallocate(more than DRAM) Adam Borowski
2019-02-18 15:15 ` Matthew Wilcox
2019-02-18 20:25 ` Adam Borowski
2019-02-19  3:35   ` Hugh Dickins
2019-02-19  4:16     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).