linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Phillips <phillips@istop.com>
To: Rik van Riel <riel@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	akpm@osdl.org, torvalds@osdl.org, hbryan@us.ibm.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	pavel@ucw.cz
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Date: Fri, 3 Dec 2004 17:07:26 -0500	[thread overview]
Message-ID: <200412031707.27270.phillips@istop.com> (raw)
In-Reply-To: <Pine.LNX.4.61.0411271200580.12575@chimarrao.boston.redhat.com>

Hi Rik,

On Saturday 27 November 2004 12:07, Rik van Riel wrote:
> On Fri, 19 Nov 2004, Miklos Szeredi wrote:
> > The solution I'm thinking is along the lines of accounting the number
> > of writable pages assigned to FUSE filesystems.  Limiting this should
> > solve the deadlock problem.  This would only impact performance for
> > shared writable mappings, which are rare anyway.
>
> Note that NFS, and any filesystems on iSCSI or g/e/ndb block
> devices have the exact same problem.  To explain why this is
> the case, lets start with the VM allocation and pageout
> thresholds:
>
>    pages_min ------------------
>
>   GFP_ATOMIC ------------------
>
> PF_MEMALLOC ------------------
>
>     0 ------------------
>
> When writing out a dirty page, the pageout code is allowed
> to allocate network buffers down to the PF_MEMALLOC boundary.
>
> However, when receiving the ACK network packets from the server,
> the network stack is only allowed to allocate memory down to the
> GFP_ATOMIC watermark.
>
> This means it is relatively easy to get the system to deadlock,
> under a heavy shared mmap workload.

Why only mmap?  What makes generic_file_write immune to this problem?

> Limiting the number of 
> simultaneous writeouts might make the problem harder to trigger,
> but is still no solution since the network layer could exhaust
> its allowed memory for other packets, and never get around to
> processing the ACKs for the pageout related network traffic!
>
> I have a solution in mind, but it's not pretty. It might be safe
> now that DaveM no longer travels.  Still have to come up with a
> way to avoid being maimed by the other network developers, though...

The network stack is only a specific case of the problem.  In general, 
deadlock is possible whenever an asynchronous task sitting in the memory 
writeout path allocates memory.  That a userspace filesystem shows this 
deadlock is only indirectly due to being in userspace.  The real problem is 
that the userspace task is a separate task from the one trying to do 
writeout, and therefore does not inherit the PF_MEMALLOC state.

A lot of people think that mempool solves this problem, but it does not.  
Consider the case of device mapper raid1:

  - Non PF_MEMALLOC tasks grab all available memory.

  - Normal device mapper writeout queues requests to its own raid1 daemon,
    which has a mempool for submitting multiple requests to underlying
    devices.

  - Non PF_MEMALLOC requests from the raid1 daemon tasks first try to do   
    normal allocation, but that fails, so they start eating into the mempool
    reserve.

  - When the mempool is empty, the next allocation forces the task into
    PF_MEMALLOC mode.

  - The PF_MEMALLOC task initiates writeout

  - The writeout submits a request to device mapper raid1

  - Device mapper queues the request to its daemon

  - The daemon needs memory to set up writes to several disks

  - But normal memory is used up and the mempool is used up too.

  - Gack, we have a problem.

So how is mempool supposed to work anyway?  The idea is, each withdrawal from 
the pool is supposed to be balanced by freeing the object some bounded time 
later, so the system always makes progress.  The fly in the ointment is that 
the mempool can be exhausted by non PF_MEMALLOC tasks that later block on 
memory inversion, and so are unable to restore the pool.  So mempool as a 
general solution is inherently broken.  It works for, e.g., bio allocation, 
where there are no asynchronous memory users in the writeout path, but this 
not the general case.  The hard case where we do have asynchronous users is 
getting more and more common.

So it seems we have an serious problem here that goes way beyond the specific 
examples people have noticed.  As I see it,  the _only_ solution is to 
propagate the MEMALLOC state (and probably NOFS as well) to the asynchronous 
task in some way.  There are many possible variations on how to do this.  A 
lot of suggestions boil down to partitioning memory in various ways.  
However, each new partitioning scheme takes a big step backwards from the 
worthy goal of cache unification.  Moreover, even with partitioning, you 
still need a scheme for knowing when to dip into the partition or the 
partition's reserve if it has one.  In other words, we immediately return to 
the question of how to propagate the PF_MEMALLOC state.  This is the real 
problem, and we need to focus on it.

By the way, what is your secret solution for the network stack?

Regards,

Daniel

  parent reply	other threads:[~2004-12-03 22:05 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-15 21:15 [PATCH] [Request for inclusion] Filesystem in Userspace Miklos Szeredi
2004-11-15 21:43 ` Greg KH
2004-11-15 22:35 ` Linus Torvalds
2004-11-16  9:08   ` Miklos Szeredi
2004-11-16  9:18     ` Arjan van de Ven
2004-11-16  9:40       ` Miklos Szeredi
2004-11-16  9:46         ` Arjan van de Ven
2004-11-16  9:52           ` Miklos Szeredi
2004-11-16 10:17             ` David Woodhouse
2004-11-16 10:25               ` Miklos Szeredi
2004-11-16 10:13     ` Pekka Enberg
2004-11-16 10:20       ` Miklos Szeredi
2004-11-16 10:35         ` Pekka Enberg
2004-11-16 10:42           ` Miklos Szeredi
2004-11-16 12:19             ` Pekka Enberg
2004-11-16 11:01           ` Simon Braunschmidt
2004-11-16 11:20             ` Pekka Enberg
2004-11-16 15:18               ` Ralph Corderoy
2004-11-16 12:31             ` Jan Engelhardt
2004-11-16 11:02         ` Jan Kratochvil
2004-11-16 14:01           ` Miklos Szeredi
2004-11-16 16:33             ` Greg KH
2004-11-16 16:45               ` Miklos Szeredi
2004-11-16 17:03                 ` Greg KH
2004-11-16 17:50                   ` Miklos Szeredi
2004-11-16 17:58                     ` Greg KH
2004-11-16 19:09                       ` Miklos Szeredi
2004-11-16 19:16                         ` Greg KH
2004-11-16 19:30                           ` Miklos Szeredi
2004-11-16 19:38                             ` Greg KH
2004-11-16 19:24                         ` Jan Engelhardt
2004-11-16 19:32                           ` Miklos Szeredi
2004-11-16 19:42                             ` Anton Altaparmakov
2004-11-16 19:48                               ` Jan Engelhardt
2004-11-16 20:12                               ` Miklos Szeredi
2004-11-17 15:42               ` Miklos Szeredi
2004-11-17 16:57                 ` Nikita Danilov
2004-11-17 17:10                   ` Jan Engelhardt
2004-11-17 17:33                     ` Nikita Danilov
2004-11-17 17:38                       ` Jan Engelhardt
2004-11-17 17:58                         ` Nikita Danilov
2004-11-17 18:09                           ` Jan Engelhardt
2004-11-17 19:58                             ` Mike Waychison
2004-11-17 18:53                           ` [PATCH] " Al Viro
2004-11-17 17:56                   ` Miklos Szeredi
2004-11-17 18:11                     ` Greg KH
2004-11-17 18:17                       ` Miklos Szeredi
2004-11-17 18:20                     ` Nikita Danilov
2004-11-17 17:52                 ` Greg KH
2004-11-17 15:36             ` Alan Cox
2004-11-17 21:37               ` Bryan Henderson
2004-11-17 19:00 ` Pavel Machek
2004-11-17 19:45   ` Miklos Szeredi
2004-11-17 20:44     ` Pavel Machek
2004-11-18  8:17       ` Miklos Szeredi
2004-11-18 14:46         ` Pavel Machek
2004-11-21  7:42           ` Miklos Szeredi
2004-11-21  7:50             ` Miklos Szeredi
2004-11-21  9:50             ` Jan Hudec
2004-11-21 10:31               ` Miklos Szeredi
2004-11-21 10:39                 ` Jan Hudec
2004-11-21 11:29                   ` Miklos Szeredi
2004-11-21 11:53                     ` Anton Altaparmakov
2004-11-21 12:01                       ` Miklos Szeredi
2004-11-21 18:13             ` Pavel Machek
2004-11-22 16:12               ` Miklos Szeredi
2004-11-18 17:00         ` Bryan Henderson
2004-11-18 17:14           ` Miklos Szeredi
2004-11-18 18:49             ` Bryan Henderson
2004-11-18 19:12               ` Miklos Szeredi
2004-11-19  7:01               ` Jan Engelhardt
2004-11-20 12:00             ` Jan Hudec
2004-11-18 17:12         ` Bryan Henderson
2004-11-18 17:28           ` Miklos Szeredi
2004-11-18 18:01             ` Linus Torvalds
2004-11-18 17:29               ` Alan Cox
2004-11-18 18:55                 ` Linus Torvalds
2004-11-18 19:28                   ` Miklos Szeredi
2004-11-19  9:46                     ` Pavel Machek
2004-11-18 20:57                   ` Andrew Morton
2004-11-24  6:20                     ` Daniel Phillips
2004-11-24 12:15                 ` Avi Kivity
2004-11-24 13:05                   ` Miklos Szeredi
     [not found]                     ` <200411242001.59504.oliver@neukum.org>
2004-11-24 19:20                       ` Miklos Szeredi
2004-11-25  6:26                     ` Jan Hudec
2004-11-25  7:29                       ` Miklos Szeredi
2004-11-25  7:47                         ` Jan Hudec
2004-11-25  9:15                           ` Miklos Szeredi
2004-11-25  9:54                           ` Pavel Machek
2004-11-30 18:44                     ` Avi Kivity
2004-11-30 19:16                       ` Miklos Szeredi
2004-11-30 19:55                         ` Avi Kivity
2004-11-30 21:13                           ` Miklos Szeredi
2004-11-30 21:37                             ` Avi Kivity
2004-11-30 21:58                               ` Miklos Szeredi
2004-11-30 22:57                                 ` Avi Kivity
2004-11-30 23:19                                   ` Miklos Szeredi
2004-12-15 17:55                                     ` Avi Kivity
2004-12-15 21:49                                       ` Miklos Szeredi
2004-12-03 22:07                               ` Daniel Phillips
2004-12-15 17:45                                 ` Avi Kivity
2004-12-01  7:16                             ` Jan Hudec
2004-12-01 13:35                               ` Miklos Szeredi
2004-11-30 21:54                       ` Pavel Machek
2004-11-18 18:21               ` Miklos Szeredi
2004-11-18 18:31                 ` Linus Torvalds
2004-11-18 18:56                   ` Miklos Szeredi
2004-11-18 19:16                     ` Linus Torvalds
2004-11-18 19:33                       ` Miklos Szeredi
2004-11-18 19:43                         ` Linus Torvalds
2004-11-18 20:05                           ` Miklos Szeredi
2004-11-18 20:13                       ` Nikita Danilov
2004-11-18 21:06                     ` Andrew Morton
2004-11-18 21:33                       ` Miklos Szeredi
2004-11-19 11:27                       ` Miklos Szeredi
2004-11-27 17:07                         ` Rik van Riel
2004-11-27 17:13                           ` Pavel Machek
2004-12-03 22:07                           ` Daniel Phillips [this message]
2004-11-18 20:16                   ` Elladan
2004-11-18 18:28               ` Jamie Lokier
2004-11-18 18:47                 ` Linus Torvalds
2004-11-18 19:12             ` Bryan Henderson
2004-11-18 19:51               ` Miklos Szeredi
2004-11-18 22:00                 ` Jan Hudec

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200412031707.27270.phillips@istop.com \
    --to=phillips@istop.com \
    --cc=akpm@osdl.org \
    --cc=hbryan@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=pavel@ucw.cz \
    --cc=riel@redhat.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).