linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Mike Christie <mchristi@redhat.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"James.Bottomley@HansenPartnership.com" 
	<James.Bottomley@HansenPartnership.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
Date: Tue, 10 Sep 2019 15:41:16 +0300	[thread overview]
Message-ID: <20190910124116.74pxl73rybmkl5j3@box> (raw)
In-Reply-To: <DM6PR04MB5817096434DE9381DDB55184E7B60@DM6PR04MB5817.namprd04.prod.outlook.com>

On Tue, Sep 10, 2019 at 12:05:33PM +0000, Damien Le Moal wrote:
> On 2019/09/10 11:00, Kirill A. Shutemov wrote:
> > On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
> >> There are several storage drivers like dm-multipath, iscsi, and nbd that
> >> have userspace components that can run in the IO path. For example,
> >> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> >> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> >> to figure out the state of paths and re-set them up.
> >>
> >> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> >> memalloc_*_save/restore functions to control the allocation behavior,
> >> but for userspace we would end up hitting a allocation that ended up
> >> writing data back to the same device we are trying to allocate for.
> >>
> >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> >> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
> >> depending on what other drivers and userspace file systems need, for
> >> the final version I can add the other flags for that file or do a file
> >> per flag or just do a memalloc_noio file.
> >>
> >> Signed-off-by: Mike Christie <mchristi@redhat.com>
> >> ---
> >>  Documentation/filesystems/proc.txt |  6 ++++
> >>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
> >>  2 files changed, 59 insertions(+)
> >>
> >> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> index 99ca040e3f90..b5456a61a013 100644
> >> --- a/Documentation/filesystems/proc.txt
> >> +++ b/Documentation/filesystems/proc.txt
> >> @@ -46,6 +46,7 @@ Table of Contents
> >>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> >>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> >>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> >> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >>  
> >>    4	Configuring procfs
> >>    4.1	Mount options
> >> @@ -1980,6 +1981,11 @@ Example
> >>   $ cat /proc/6753/arch_status
> >>   AVX512_elapsed_ms:      8
> >>  
> >> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >> +-----------------------------------------------------------------------
> >> +A value of "noio" indicates that when a task allocates memory it will not
> >> +reclaim memory that requires starting phisical IO.
> >> +
> >>  Description
> >>  -----------
> >>  
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index ebea9501afb8..c4faa3464602 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
> >>  	.llseek		= default_llseek,
> >>  };
> >>  
> >> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
> >> +			     loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	ssize_t rc = 0;
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (task->flags & PF_MEMALLOC_NOIO)
> >> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
> >> +	put_task_struct(task);
> >> +	return rc;
> >> +}
> >> +
> >> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	char buffer[5];
> >> +	int rc = count;
> >> +
> >> +	memset(buffer, 0, sizeof(buffer));
> >> +	if (count != sizeof(buffer) - 1)
> >> +		return -EINVAL;
> >> +
> >> +	if (copy_from_user(buffer, buf, count))
> >> +		return -EFAULT;
> >> +	buffer[count] = '\0';
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (!strcmp(buffer, "noio")) {
> >> +		task->flags |= PF_MEMALLOC_NOIO;
> >> +	} else {
> >> +		rc = -EINVAL;
> >> +	}
> > 
> > Really? Without any privilege check? So any random user can tap into
> > __GFP_NOIO allocations?
> 
> OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since
> these storage daemons are generally run as root anyway, that would still work
> for most setup I think.
> 
> > 
> > NAK.
> > 
> > I don't think that it's great idea in general to expose this low-level
> > machinery to userspace. But it's better to get comment from people move
> > familiar with reclaim path.
> 
> Any setup with stacked file systems and one of the IO path component being a
> user level process can benefit from this. See the problem described in this
> patch I pushed for (unsuccessfully as it was a heavy handed solution):
> https://www.spinics.net/lists/linux-fsdevel/msg148912.html
> 
> As the discussion in this thread shows, there is no existing simple solution to
> deal with this reclaim recursion problem. And automatic detection is too hard,
> if at all possible. With the proper access rights added, this user accessible
> interface does look very sensible to me.

Looking into the thread, have you find out if there's anything on FUSE
side that helps it to avoid deadlocks? Or FUSE just relies on luck with
this?

-- 
 Kirill A. Shutemov

  reply	other threads:[~2019-09-10 12:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09 16:28 [RFC PATCH] Add proc interface to set PF_MEMALLOC flags Mike Christie
2019-09-09 18:26 ` Mike Christie
2019-09-10  8:35   ` Damien Le Moal
2019-09-11  8:43     ` Martin Raiber
     [not found]     ` <0102016d1f7af966-334f093b-2a62-4baa-9678-8d90d5fba6d9-000000@eu-west-1.amazonses.com>
2019-09-11 16:56       ` Mike Christie
2019-09-11 19:21         ` Martin Raiber
2019-09-12 16:22           ` Mike Christie
2019-09-12 16:27             ` Mike Christie
2019-09-10 22:12   ` Tetsuo Handa
2019-09-10 23:28     ` Kirill A. Shutemov
2019-09-11 15:23     ` Mike Christie
2019-09-10 10:00 ` Kirill A. Shutemov
2019-09-10 12:05   ` Damien Le Moal
2019-09-10 12:41     ` Kirill A. Shutemov [this message]
2019-09-10 13:37       ` Damien Le Moal
2019-09-10 16:06   ` Mike Christie
2019-09-11  8:23 ` Bart Van Assche
     [not found] <20190911031348.9648-1-hdanton@sina.com>
2019-09-11 10:07 ` Tetsuo Handa
2019-09-11 15:44   ` Mike Christie
     [not found] ` <20190911135237.11248-1-hdanton@sina.com>
2019-09-11 14:20   ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190910124116.74pxl73rybmkl5j3@box \
    --to=kirill@shutemov.name \
    --cc=Damien.LeMoal@wdc.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mchristi@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).