Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
       [not found] <20190911031348.9648-1-hdanton@sina.com>
@ 2019-09-11 10:07 ` Tetsuo Handa
  2019-09-11 15:44   ` Mike Christie
       [not found] ` <20190911135237.11248-1-hdanton@sina.com>
  1 sibling, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2019-09-11 10:07 UTC (permalink / raw)
  To: Hillf Danton, Mike Christie
  Cc: Kirill A. Shutemov, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block, Linux-MM

On 2019/09/11 12:13, Hillf Danton wrote:
> 
> On Tue, 10 Sep 2019 11:06:03 -0500 From: Mike Christie <mchristi@redhat.com>
>>
>>> Really? Without any privilege check? So any random user can tap into
>>> __GFP_NOIO allocations?
>>
>> That was a mistake on my part. I will add it in.
>>
> You may alternatively madvise a nutcracker as long as you would have
> added a sledgehammer under /proc instead of a gavel.
> 
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -45,6 +45,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_NOIO	5		/* set PF_MEMALLOC_NOIO */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -716,6 +716,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
> +	case MADV_NOIO:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> @@ -813,6 +814,11 @@ SYSCALL_DEFINE3(madvise, unsigned long,
>  	if (!madvise_behavior_valid(behavior))
>  		return error;
>  
> +	if (behavior == MADV_NOIO) {
> +		current->flags |= PF_MEMALLOC_NOIO;

Yes, for "modifying p->flags when p != current" is not permitted.

But I guess that there is a problem. Setting PF_MEMALLOC_NOIO causes
current_gfp_context() to mask __GFP_IO | __GFP_FS, but the OOM killer cannot
be invoked when __GFP_FS is masked. As a result, any userspace thread which
has PF_MEMALLOC_NOIO cannot invoke the OOM killer. If the userspace thread
which uses PF_MEMALLOC_NOIO is involved in memory reclaiming activities,
the memory reclaiming activities won't be able to make forward progress when
the userspace thread triggered e.g. a page fault. Can the "userspace components
that can run in the IO path" survive without any memory allocation?

> +		return 0;
> +	}
> +
>  	if (start & ~PAGE_MASK)
>  		return error;
>  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
       [not found] ` <20190911135237.11248-1-hdanton@sina.com>
@ 2019-09-11 14:20   ` Tetsuo Handa
  0 siblings, 0 replies; 20+ messages in thread
From: Tetsuo Handa @ 2019-09-11 14:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Mike Christie, Kirill A. Shutemov, axboe, James.Bottomley,
	martin.petersen, linux-kernel, linux-scsi, linux-block, Linux-MM

On 2019/09/11 22:52, Hillf Danton wrote:
> 
> On Wed, 11 Sep 2019 19:07:34 +0900
>>
>> But I guess that there is a problem.
> 
> Not a new one. (see commit 7dea19f9ee63)
> 
>> Setting PF_MEMALLOC_NOIO causes
>> current_gfp_context() to mask __GFP_IO | __GFP_FS, but the OOM killer cannot
>> be invoked when __GFP_FS is masked. As a result, any userspace thread which
>> has PF_MEMALLOC_NOIO cannot invoke the OOM killer.
> 
> Correct.
> 
>> If the userspace thread
>> which uses PF_MEMALLOC_NOIO is involved in memory reclaiming activities,
>> the memory reclaiming activities won't be able to make forward progress when
>> the userspace thread triggered e.g. a page fault. Can the "userspace components
>> that can run in the IO path" survive without any memory allocation?
> 
> Good question.
> 
> It can be solved without oom killer involved because user should be
> aware of the risk of PF_MEMALLOC_NOIO if they ask for the convenience.
> OTOH we are able to control any abuse of it as you worry, knowing that
> the combination of __GFP_FS and oom killer can not get more than 50 users
> works done, and we have to pay as much attention as we can to the decisions
> they make. In case of PF_MEMALLOC_NOIO, we simply fail the allocation
> rather than killing a random victim.

According to commit c288983dddf71421 ("mm/page_alloc.c: make sure OOM victim can
try allocations with no watermarks once"), memory allocation failure from a page
fault results in invocation of the OOM killer via pagefault_out_of_memory() which
after all kills a random victim.

> 
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3854,6 +3854,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
>  	 * out_of_memory). Once filesystems are ready to handle allocation
>  	 * failures more gracefully we should just bail out here.
>  	 */
> +	if (current->flags & PF_MEMALLOC_NOIO)
> +		goto out;
>  
>  	/* The OOM killer may not free memory on a specific node */
>  	if (gfp_mask & __GFP_THISNODE)
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-11 10:07 ` [RFC PATCH] Add proc interface to set PF_MEMALLOC flags Tetsuo Handa
@ 2019-09-11 15:44   ` Mike Christie
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-11 15:44 UTC (permalink / raw)
  To: Tetsuo Handa, Hillf Danton
  Cc: Kirill A. Shutemov, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block, Linux-MM

On 09/11/2019 05:07 AM, Tetsuo Handa wrote:
> On 2019/09/11 12:13, Hillf Danton wrote:
>>
>> On Tue, 10 Sep 2019 11:06:03 -0500 From: Mike Christie <mchristi@redhat.com>
>>>
>>>> Really? Without any privilege check? So any random user can tap into
>>>> __GFP_NOIO allocations?
>>>
>>> That was a mistake on my part. I will add it in.
>>>
>> You may alternatively madvise a nutcracker as long as you would have
>> added a sledgehammer under /proc instead of a gavel.
>>
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -45,6 +45,7 @@
>>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>>  #define MADV_WILLNEED	3		/* will need these pages */
>>  #define MADV_DONTNEED	4		/* don't need these pages */
>> +#define MADV_NOIO	5		/* set PF_MEMALLOC_NOIO */
>>  
>>  /* common parameters: try to keep these consistent across architectures */
>>  #define MADV_FREE	8		/* free pages only if memory pressure */
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -716,6 +716,7 @@ madvise_behavior_valid(int behavior)
>>  	case MADV_WILLNEED:
>>  	case MADV_DONTNEED:
>>  	case MADV_FREE:
>> +	case MADV_NOIO:
>>  #ifdef CONFIG_KSM
>>  	case MADV_MERGEABLE:
>>  	case MADV_UNMERGEABLE:
>> @@ -813,6 +814,11 @@ SYSCALL_DEFINE3(madvise, unsigned long,
>>  	if (!madvise_behavior_valid(behavior))
>>  		return error;
>>  
>> +	if (behavior == MADV_NOIO) {
>> +		current->flags |= PF_MEMALLOC_NOIO;
> 
> Yes, for "modifying p->flags when p != current" is not permitted.
> 
> But I guess that there is a problem. Setting PF_MEMALLOC_NOIO causes
> current_gfp_context() to mask __GFP_IO | __GFP_FS, but the OOM killer cannot
> be invoked when __GFP_FS is masked. As a result, any userspace thread which
> has PF_MEMALLOC_NOIO cannot invoke the OOM killer. If the userspace thread
> which uses PF_MEMALLOC_NOIO is involved in memory reclaiming activities,
> the memory reclaiming activities won't be able to make forward progress when
> the userspace thread triggered e.g. a page fault. Can the "userspace components
> that can run in the IO path" survive without any memory allocation?
> 

Yes and no, when they can they will have preallocated the resources they
need to make forward progress similar to how kernel storage drivers do.
However for some resources, like in the network layer, both userspace
and kernel drivers are not able to preallocate and may fail.


>> +		return 0;
>> +	}
>> +
>>  	if (start & ~PAGE_MASK)
>>  		return error;
>>  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-12 16:22           ` Mike Christie
@ 2019-09-12 16:27             ` Mike Christie
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-12 16:27 UTC (permalink / raw)
  To: Martin Raiber, linux-kernel, linux-scsi, linux-block, Linux-MM

On 09/12/2019 11:22 AM, Mike Christie wrote:
> On 09/11/2019 02:21 PM, Martin Raiber wrote:
>> On 11.09.2019 18:56 Mike Christie wrote:
>>> On 09/11/2019 03:40 AM, Martin Raiber wrote:
>>>> On 10.09.2019 10:35 Damien Le Moal wrote:
>>>>> Mike,
>>>>>
>>>>> On 2019/09/09 19:26, Mike Christie wrote:
>>>>>> Forgot to cc linux-mm.
>>>>>>
>>>>>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>>>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>>>>> have userspace components that can run in the IO path. For example,
>>>>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>>>>> to figure out the state of paths and re-set them up.
>>>>>>>
>>>>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>>>>> but for userspace we would end up hitting a allocation that ended up
>>>>>>> writing data back to the same device we are trying to allocate for.
>>>>>>>
>>>>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>>>>> depending on what other drivers and userspace file systems need, for
>>>>>>> the final version I can add the other flags for that file or do a file
>>>>>>> per flag or just do a memalloc_noio file.
>>>>> Awesome. That probably will be the perfect solution for the problem we hit with
>>>>> tcmu-runner a while back (please see this thread:
>>>>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>>>>>
>>>>> I think we definitely need nofs as well for dealing with cases where the backend
>>>>> storage for the user daemon is a file.
>>>>>
>>>>> I will give this patch a try as soon as possible (I am traveling currently).
>>>>>
>>>>> Best regards.
>>>> I had issues with this as well, and work on this is appreciated! In my
>>>> case it is a loop block device on a fuse file system.
>>>> Setting PF_LESS_THROTTLE was the one that helped the most, though, so
>>>> add an option for that as well? I set this via prctl() for the thread
>>>> calling it (was easiest to add to).
>>>>
>>>> Sorry, I have no idea about the current rationale, but wouldn't it be
>>>> better to have a way to mask a set of block devices/file systems not to
>>>> write-back to in a thread. So in my case I'd specify that the fuse
>>>> daemon threads cannot write-back to the file system and loop device
>>>> running on top of the fuse file system, while all other block
>>>> devices/file systems can be write-back to (causing less swapping/OOM
>>>> issues).
>>> I'm not sure I understood you.
>>>
>>> The storage daemons I mentioned normally kick off N threads per M
>>> devices. The threads handle duties like IO and error handling for those
>>> devices. Those threads would set the flag, so those IO/error-handler
>>> related operations do not end up writing back to them. So it works
>>> similar to how storage drivers work in the kernel where iscsi_tcp has an
>>> xmit thread and that does memalloc_noreclaim_save. Only the threads for
>>> those specific devices being would set the flag.
>>>
>>> In your case, it sounds like you have a thread/threads that would
>>> operate on multiple devices and some need the behavior and some do not.
>>> Is that right?
>>
>> No, sounds the same as your case. As an example think of vdfuse (or
>> qemu-nbd locally). You'd have something like
>>
>> ext4(a) <- loop <- fuse file system <- vdfuse <- disk.vdi container file
>> <- ext4(b) <- block device
>>
>> If vdfuse threads cause writeback to ext4(a), you'd get the issue we
>> have. Setting PF_LESS_THROTTLE and/or PF_MEMALLOC_NOIO mostly avoids
>> this problem, but with only PF_LESS_THROTTLE there are still corner
>> cases (I think if ext4(b) slows down suddenly) where it wedges itself
>> and the side effect of setting PF_MEMALLOC_NOIO are being discussed...
>> The best solution would be, I guess, to have a way for vdfuse to set
>> something, such that write-back to ext4(a) isn't allowed from those
>> threads, but write-back to ext4(b) (and all other block devices) is. But
>> I only have a rough idea of how write-back works, so this is really only
>> a guess.
> 
> I see now.
> 
> Initially, would it be ok to keep it simple and keep the existing kernel
> behavior? For your example, is the PF_MEMALLOC_NOIO use in loop today

Or do it in two stages.

1. For devices like mine, we just use the existing behavior where it
gets set for the thread and is for all devices. We know from iscsi/nbd
it is already ok from their kernel use. I do not need to add any extra
locking/complexity to the block, vm, fs code.

2. We can then add the ability to pass in a mount or upper layer block
device for setups like yours where we already know the topology and it
isn't going to change.


> causing a lot of swap/oom issues? For iscsi_tcp and nbd their memalloc
> and GFP_NOIO use is not.
> 
> The problem for the storage driver daemons I mentioned in the patch is
> that they are at the bottom of the stack and they do not know what is
> going to be added above them plus it can change, so we will have to walk
> the storage device stack while IO is running and allocations are trying
> to execute. It looks like I will end up having to insert extra
> locking/refcounts into multiple layers, and I am not sure if the extra
> complexity is going to be worth it if we are not seeing problems from
> existing kernel users.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-11 19:21         ` Martin Raiber
@ 2019-09-12 16:22           ` Mike Christie
  2019-09-12 16:27             ` Mike Christie
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Christie @ 2019-09-12 16:22 UTC (permalink / raw)
  To: Martin Raiber, linux-kernel, linux-scsi, linux-block, Linux-MM

On 09/11/2019 02:21 PM, Martin Raiber wrote:
> On 11.09.2019 18:56 Mike Christie wrote:
>> On 09/11/2019 03:40 AM, Martin Raiber wrote:
>>> On 10.09.2019 10:35 Damien Le Moal wrote:
>>>> Mike,
>>>>
>>>> On 2019/09/09 19:26, Mike Christie wrote:
>>>>> Forgot to cc linux-mm.
>>>>>
>>>>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>>>> have userspace components that can run in the IO path. For example,
>>>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>>>> to figure out the state of paths and re-set them up.
>>>>>>
>>>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>>>> but for userspace we would end up hitting a allocation that ended up
>>>>>> writing data back to the same device we are trying to allocate for.
>>>>>>
>>>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>>>> depending on what other drivers and userspace file systems need, for
>>>>>> the final version I can add the other flags for that file or do a file
>>>>>> per flag or just do a memalloc_noio file.
>>>> Awesome. That probably will be the perfect solution for the problem we hit with
>>>> tcmu-runner a while back (please see this thread:
>>>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>>>>
>>>> I think we definitely need nofs as well for dealing with cases where the backend
>>>> storage for the user daemon is a file.
>>>>
>>>> I will give this patch a try as soon as possible (I am traveling currently).
>>>>
>>>> Best regards.
>>> I had issues with this as well, and work on this is appreciated! In my
>>> case it is a loop block device on a fuse file system.
>>> Setting PF_LESS_THROTTLE was the one that helped the most, though, so
>>> add an option for that as well? I set this via prctl() for the thread
>>> calling it (was easiest to add to).
>>>
>>> Sorry, I have no idea about the current rationale, but wouldn't it be
>>> better to have a way to mask a set of block devices/file systems not to
>>> write-back to in a thread. So in my case I'd specify that the fuse
>>> daemon threads cannot write-back to the file system and loop device
>>> running on top of the fuse file system, while all other block
>>> devices/file systems can be write-back to (causing less swapping/OOM
>>> issues).
>> I'm not sure I understood you.
>>
>> The storage daemons I mentioned normally kick off N threads per M
>> devices. The threads handle duties like IO and error handling for those
>> devices. Those threads would set the flag, so those IO/error-handler
>> related operations do not end up writing back to them. So it works
>> similar to how storage drivers work in the kernel where iscsi_tcp has an
>> xmit thread and that does memalloc_noreclaim_save. Only the threads for
>> those specific devices being would set the flag.
>>
>> In your case, it sounds like you have a thread/threads that would
>> operate on multiple devices and some need the behavior and some do not.
>> Is that right?
> 
> No, sounds the same as your case. As an example think of vdfuse (or
> qemu-nbd locally). You'd have something like
> 
> ext4(a) <- loop <- fuse file system <- vdfuse <- disk.vdi container file
> <- ext4(b) <- block device
> 
> If vdfuse threads cause writeback to ext4(a), you'd get the issue we
> have. Setting PF_LESS_THROTTLE and/or PF_MEMALLOC_NOIO mostly avoids
> this problem, but with only PF_LESS_THROTTLE there are still corner
> cases (I think if ext4(b) slows down suddenly) where it wedges itself
> and the side effect of setting PF_MEMALLOC_NOIO are being discussed...
> The best solution would be, I guess, to have a way for vdfuse to set
> something, such that write-back to ext4(a) isn't allowed from those
> threads, but write-back to ext4(b) (and all other block devices) is. But
> I only have a rough idea of how write-back works, so this is really only
> a guess.

I see now.

Initially, would it be ok to keep it simple and keep the existing kernel
behavior? For your example, is the PF_MEMALLOC_NOIO use in loop today
causing a lot of swap/oom issues? For iscsi_tcp and nbd their memalloc
and GFP_NOIO use is not.

The problem for the storage driver daemons I mentioned in the patch is
that they are at the bottom of the stack and they do not know what is
going to be added above them plus it can change, so we will have to walk
the storage device stack while IO is running and allocations are trying
to execute. It looks like I will end up having to insert extra
locking/refcounts into multiple layers, and I am not sure if the extra
complexity is going to be worth it if we are not seeing problems from
existing kernel users.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-11 16:56       ` Mike Christie
@ 2019-09-11 19:21         ` Martin Raiber
  2019-09-12 16:22           ` Mike Christie
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Raiber @ 2019-09-11 19:21 UTC (permalink / raw)
  To: Mike Christie, linux-kernel, linux-scsi, linux-block, Linux-MM

On 11.09.2019 18:56 Mike Christie wrote:
> On 09/11/2019 03:40 AM, Martin Raiber wrote:
>> On 10.09.2019 10:35 Damien Le Moal wrote:
>>> Mike,
>>>
>>> On 2019/09/09 19:26, Mike Christie wrote:
>>>> Forgot to cc linux-mm.
>>>>
>>>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>>> have userspace components that can run in the IO path. For example,
>>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>>> to figure out the state of paths and re-set them up.
>>>>>
>>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>>> but for userspace we would end up hitting a allocation that ended up
>>>>> writing data back to the same device we are trying to allocate for.
>>>>>
>>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>>> depending on what other drivers and userspace file systems need, for
>>>>> the final version I can add the other flags for that file or do a file
>>>>> per flag or just do a memalloc_noio file.
>>> Awesome. That probably will be the perfect solution for the problem we hit with
>>> tcmu-runner a while back (please see this thread:
>>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>>>
>>> I think we definitely need nofs as well for dealing with cases where the backend
>>> storage for the user daemon is a file.
>>>
>>> I will give this patch a try as soon as possible (I am traveling currently).
>>>
>>> Best regards.
>> I had issues with this as well, and work on this is appreciated! In my
>> case it is a loop block device on a fuse file system.
>> Setting PF_LESS_THROTTLE was the one that helped the most, though, so
>> add an option for that as well? I set this via prctl() for the thread
>> calling it (was easiest to add to).
>>
>> Sorry, I have no idea about the current rationale, but wouldn't it be
>> better to have a way to mask a set of block devices/file systems not to
>> write-back to in a thread. So in my case I'd specify that the fuse
>> daemon threads cannot write-back to the file system and loop device
>> running on top of the fuse file system, while all other block
>> devices/file systems can be write-back to (causing less swapping/OOM
>> issues).
> I'm not sure I understood you.
>
> The storage daemons I mentioned normally kick off N threads per M
> devices. The threads handle duties like IO and error handling for those
> devices. Those threads would set the flag, so those IO/error-handler
> related operations do not end up writing back to them. So it works
> similar to how storage drivers work in the kernel where iscsi_tcp has an
> xmit thread and that does memalloc_noreclaim_save. Only the threads for
> those specific devices being would set the flag.
>
> In your case, it sounds like you have a thread/threads that would
> operate on multiple devices and some need the behavior and some do not.
> Is that right?

No, sounds the same as your case. As an example think of vdfuse (or
qemu-nbd locally). You'd have something like

ext4(a) <- loop <- fuse file system <- vdfuse <- disk.vdi container file
<- ext4(b) <- block device

If vdfuse threads cause writeback to ext4(a), you'd get the issue we
have. Setting PF_LESS_THROTTLE and/or PF_MEMALLOC_NOIO mostly avoids
this problem, but with only PF_LESS_THROTTLE there are still corner
cases (I think if ext4(b) slows down suddenly) where it wedges itself
and the side effect of setting PF_MEMALLOC_NOIO are being discussed...
The best solution would be, I guess, to have a way for vdfuse to set
something, such that write-back to ext4(a) isn't allowed from those
threads, but write-back to ext4(b) (and all other block devices) is. But
I only have a rough idea of how write-back works, so this is really only
a guess.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
       [not found]     ` <0102016d1f7af966-334f093b-2a62-4baa-9678-8d90d5fba6d9-000000@eu-west-1.amazonses.com>
@ 2019-09-11 16:56       ` Mike Christie
  2019-09-11 19:21         ` Martin Raiber
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Christie @ 2019-09-11 16:56 UTC (permalink / raw)
  To: Martin Raiber, linux-kernel, linux-scsi, linux-block, Linux-MM

On 09/11/2019 03:40 AM, Martin Raiber wrote:
> On 10.09.2019 10:35 Damien Le Moal wrote:
>> Mike,
>>
>> On 2019/09/09 19:26, Mike Christie wrote:
>>> Forgot to cc linux-mm.
>>>
>>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>> have userspace components that can run in the IO path. For example,
>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>> to figure out the state of paths and re-set them up.
>>>>
>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>> but for userspace we would end up hitting a allocation that ended up
>>>> writing data back to the same device we are trying to allocate for.
>>>>
>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>> depending on what other drivers and userspace file systems need, for
>>>> the final version I can add the other flags for that file or do a file
>>>> per flag or just do a memalloc_noio file.
>> Awesome. That probably will be the perfect solution for the problem we hit with
>> tcmu-runner a while back (please see this thread:
>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>>
>> I think we definitely need nofs as well for dealing with cases where the backend
>> storage for the user daemon is a file.
>>
>> I will give this patch a try as soon as possible (I am traveling currently).
>>
>> Best regards.
> 
> I had issues with this as well, and work on this is appreciated! In my
> case it is a loop block device on a fuse file system.
> Setting PF_LESS_THROTTLE was the one that helped the most, though, so
> add an option for that as well? I set this via prctl() for the thread
> calling it (was easiest to add to).
> 
> Sorry, I have no idea about the current rationale, but wouldn't it be
> better to have a way to mask a set of block devices/file systems not to
> write-back to in a thread. So in my case I'd specify that the fuse
> daemon threads cannot write-back to the file system and loop device
> running on top of the fuse file system, while all other block
> devices/file systems can be write-back to (causing less swapping/OOM
> issues).

I'm not sure I understood you.

The storage daemons I mentioned normally kick off N threads per M
devices. The threads handle duties like IO and error handling for those
devices. Those threads would set the flag, so those IO/error-handler
related operations do not end up writing back to them. So it works
similar to how storage drivers work in the kernel where iscsi_tcp has an
xmit thread and that does memalloc_noreclaim_save. Only the threads for
those specific devices being would set the flag.

In your case, it sounds like you have a thread/threads that would
operate on multiple devices and some need the behavior and some do not.
Is that right?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 22:12   ` Tetsuo Handa
  2019-09-10 23:28     ` Kirill A. Shutemov
@ 2019-09-11 15:23     ` Mike Christie
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-11 15:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block, Linux-MM

On 09/10/2019 05:12 PM, Tetsuo Handa wrote:
> On 2019/09/10 3:26, Mike Christie wrote:
>> Forgot to cc linux-mm.
>>
>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>> have userspace components that can run in the IO path. For example,
>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>> to figure out the state of paths and re-set them up.
>>>
>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>> memalloc_*_save/restore functions to control the allocation behavior,
>>> but for userspace we would end up hitting a allocation that ended up
>>> writing data back to the same device we are trying to allocate for.
>>>
>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>> depending on what other drivers and userspace file systems need, for
>>> the final version I can add the other flags for that file or do a file
>>> per flag or just do a memalloc_noio file.
> 
> Interesting patch. But can't we instead globally mask __GFP_NOFS / __GFP_NOIO
> than playing games with per a thread masking (which suffers from inability to
> propagate current thread's mask to other threads indirectly involved)?

If I understood you, then that had been discussed in the past:

https://www.spinics.net/lists/linux-fsdevel/msg149035.html

We only need this for specific threads which implement part of a storage
driver in userspace.

> 
>>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct task_struct *task;
>>> +	char buffer[5];
>>> +	int rc = count;
>>> +
>>> +	memset(buffer, 0, sizeof(buffer));
>>> +	if (count != sizeof(buffer) - 1)
>>> +		return -EINVAL;
>>> +
>>> +	if (copy_from_user(buffer, buf, count))
> 
> copy_from_user() / copy_to_user() might involve memory allocation
> via page fault which has to be done under the mask? Moreover, since
> just open()ing this file can involve memory allocation, do we forbid
> open("/proc/thread-self/memalloc") ?

I was having the daemons set the flag when they initialize.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10  8:35   ` Damien Le Moal
@ 2019-09-11  8:43     ` Martin Raiber
       [not found]     ` <0102016d1f7af966-334f093b-2a62-4baa-9678-8d90d5fba6d9-000000@eu-west-1.amazonses.com>
  1 sibling, 0 replies; 20+ messages in thread
From: Martin Raiber @ 2019-09-11  8:43 UTC (permalink / raw)
  To: Damien Le Moal, Mike Christie, linux-kernel, linux-scsi,
	linux-block, Linux-MM

On 10.09.2019 10:35 Damien Le Moal wrote:
> Mike,
>
> On 2019/09/09 19:26, Mike Christie wrote:
>> Forgot to cc linux-mm.
>>
>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>> have userspace components that can run in the IO path. For example,
>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>> to figure out the state of paths and re-set them up.
>>>
>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>> memalloc_*_save/restore functions to control the allocation behavior,
>>> but for userspace we would end up hitting a allocation that ended up
>>> writing data back to the same device we are trying to allocate for.
>>>
>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>> depending on what other drivers and userspace file systems need, for
>>> the final version I can add the other flags for that file or do a file
>>> per flag or just do a memalloc_noio file.
> Awesome. That probably will be the perfect solution for the problem we hit with
> tcmu-runner a while back (please see this thread:
> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>
> I think we definitely need nofs as well for dealing with cases where the backend
> storage for the user daemon is a file.
>
> I will give this patch a try as soon as possible (I am traveling currently).
>
> Best regards.
I had issues with this as well, and work on this is appreciated! In my
case it is a loop block device on a fuse file system.

Setting PF_LESS_THROTTLE was the one that helped the most, though, so
add an option for that as well? I set this via prctl() for the thread
calling it (was easiest to add to).

Sorry, I have no idea about the current rationale, but wouldn't it be
better to have a way to mask a set of block devices/file systems not to
write-back to in a thread. So in my case I'd specify that the fuse
daemon threads cannot write-back to the file system and loop device
running on top of the fuse file system, while all other block
devices/file systems can be write-back to (causing less swapping/OOM
issues).

>
>>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>>> ---
>>>  Documentation/filesystems/proc.txt |  6 ++++
>>>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>>>  2 files changed, 59 insertions(+)
>>>
>>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>>> index 99ca040e3f90..b5456a61a013 100644
>>> --- a/Documentation/filesystems/proc.txt
>>> +++ b/Documentation/filesystems/proc.txt
>>> @@ -46,6 +46,7 @@ Table of Contents
>>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>>    3.12	/proc/<pid>/arch_status - Task architecture specific information
>>> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>>  
>>>    4	Configuring procfs
>>>    4.1	Mount options
>>> @@ -1980,6 +1981,11 @@ Example
>>>   $ cat /proc/6753/arch_status
>>>   AVX512_elapsed_ms:      8
>>>  
>>> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>> +-----------------------------------------------------------------------
>>> +A value of "noio" indicates that when a task allocates memory it will not
>>> +reclaim memory that requires starting phisical IO.
>>> +
>>>  Description
>>>  -----------
>>>  
>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>> index ebea9501afb8..c4faa3464602 100644
>>> --- a/fs/proc/base.c
>>> +++ b/fs/proc/base.c
>>> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>>>  	.llseek		= default_llseek,
>>>  };
>>>  
>>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
>>> +			     loff_t *ppos)
>>> +{
>>> +	struct task_struct *task;
>>> +	ssize_t rc = 0;
>>> +
>>> +	task = get_proc_task(file_inode(file));
>>> +	if (!task)
>>> +		return -ESRCH;
>>> +
>>> +	if (task->flags & PF_MEMALLOC_NOIO)
>>> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>>> +	put_task_struct(task);
>>> +	return rc;
>>> +}
>>> +
>>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct task_struct *task;
>>> +	char buffer[5];
>>> +	int rc = count;
>>> +
>>> +	memset(buffer, 0, sizeof(buffer));
>>> +	if (count != sizeof(buffer) - 1)
>>> +		return -EINVAL;
>>> +
>>> +	if (copy_from_user(buffer, buf, count))
>>> +		return -EFAULT;
>>> +	buffer[count] = '\0';
>>> +
>>> +	task = get_proc_task(file_inode(file));
>>> +	if (!task)
>>> +		return -ESRCH;
>>> +
>>> +	if (!strcmp(buffer, "noio")) {
>>> +		task->flags |= PF_MEMALLOC_NOIO;
>>> +	} else {
>>> +		rc = -EINVAL;
>>> +	}
>>> +
>>> +	put_task_struct(task);
>>> +	return rc;
>>> +}
>>> +
>>> +static const struct file_operations proc_memalloc_operations = {
>>> +	.read		= memalloc_read,
>>> +	.write		= memalloc_write,
>>> +	.llseek		= default_llseek,
>>> +};
>>> +
>>>  #ifdef CONFIG_AUDIT
>>>  #define TMPBUFLEN 11
>>>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
>>> @@ -3097,6 +3148,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>>>  #endif
>>> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>>>  };
>>>  
>>>  static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
>>> @@ -3487,6 +3539,7 @@ static const struct pid_entry tid_base_stuff[] = {
>>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>>>  #endif
>>> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>>>  };
>>>  
>>>  static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
>>>
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-09 16:28 Mike Christie
  2019-09-09 18:26 ` Mike Christie
  2019-09-10 10:00 ` Kirill A. Shutemov
@ 2019-09-11  8:23 ` Bart Van Assche
  2 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2019-09-11  8:23 UTC (permalink / raw)
  To: Mike Christie, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block

On 9/9/19 5:28 PM, Mike Christie wrote:
> There are several storage drivers like dm-multipath, iscsi, and nbd that
> have userspace components that can run in the IO path. For example,
> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> to figure out the state of paths and re-set them up.
> 
> [ ... ]

Should the linux-api mailing list be Cc-ed for a patch like this one?
See also https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 22:12   ` Tetsuo Handa
@ 2019-09-10 23:28     ` Kirill A. Shutemov
  2019-09-11 15:23     ` Mike Christie
  1 sibling, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2019-09-10 23:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mike Christie, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block, Linux-MM

On Wed, Sep 11, 2019 at 07:12:06AM +0900, Tetsuo Handa wrote:
> >> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	char buffer[5];
> >> +	int rc = count;
> >> +
> >> +	memset(buffer, 0, sizeof(buffer));
> >> +	if (count != sizeof(buffer) - 1)
> >> +		return -EINVAL;
> >> +
> >> +	if (copy_from_user(buffer, buf, count))
> 
> copy_from_user() / copy_to_user() might involve memory allocation
> via page fault which has to be done under the mask? Moreover, since
> just open()ing this file can involve memory allocation, do we forbid
> open("/proc/thread-self/memalloc") ?

Not saying that I'm okay with the approach in general, but I don't think
this a problem. The application has to set allocation policy before
inserting itself into IO or FS path.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-09 18:26 ` Mike Christie
  2019-09-10  8:35   ` Damien Le Moal
@ 2019-09-10 22:12   ` Tetsuo Handa
  2019-09-10 23:28     ` Kirill A. Shutemov
  2019-09-11 15:23     ` Mike Christie
  1 sibling, 2 replies; 20+ messages in thread
From: Tetsuo Handa @ 2019-09-10 22:12 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block, Linux-MM

On 2019/09/10 3:26, Mike Christie wrote:
> Forgot to cc linux-mm.
> 
> On 09/09/2019 11:28 AM, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>> have userspace components that can run in the IO path. For example,
>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>> to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>>
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>> depending on what other drivers and userspace file systems need, for
>> the final version I can add the other flags for that file or do a file
>> per flag or just do a memalloc_noio file.

Interesting patch. But can't we instead globally mask __GFP_NOFS / __GFP_NOIO
than playing games with per a thread masking (which suffers from inability to
propagate current thread's mask to other threads indirectly involved)?

>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	char buffer[5];
>> +	int rc = count;
>> +
>> +	memset(buffer, 0, sizeof(buffer));
>> +	if (count != sizeof(buffer) - 1)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(buffer, buf, count))

copy_from_user() / copy_to_user() might involve memory allocation
via page fault which has to be done under the mask? Moreover, since
just open()ing this file can involve memory allocation, do we forbid
open("/proc/thread-self/memalloc") ?

>> +		return -EFAULT;
>> +	buffer[count] = '\0';
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (!strcmp(buffer, "noio")) {
>> +		task->flags |= PF_MEMALLOC_NOIO;
>> +	} else {
>> +		rc = -EINVAL;
>> +	}
>> +
>> +	put_task_struct(task);
>> +	return rc;
>> +}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 10:00 ` Kirill A. Shutemov
  2019-09-10 12:05   ` Damien Le Moal
@ 2019-09-10 16:06   ` Mike Christie
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-10 16:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block, Linux-MM

On 09/10/2019 05:00 AM, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>> have userspace components that can run in the IO path. For example,
>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>> to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>>
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>> depending on what other drivers and userspace file systems need, for
>> the final version I can add the other flags for that file or do a file
>> per flag or just do a memalloc_noio file.
>>
>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>> ---
>>  Documentation/filesystems/proc.txt |  6 ++++
>>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>>  2 files changed, 59 insertions(+)
>>
>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> index 99ca040e3f90..b5456a61a013 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -46,6 +46,7 @@ Table of Contents
>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>    3.12	/proc/<pid>/arch_status - Task architecture specific information
>> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>  
>>    4	Configuring procfs
>>    4.1	Mount options
>> @@ -1980,6 +1981,11 @@ Example
>>   $ cat /proc/6753/arch_status
>>   AVX512_elapsed_ms:      8
>>  
>> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
>> +-----------------------------------------------------------------------
>> +A value of "noio" indicates that when a task allocates memory it will not
>> +reclaim memory that requires starting phisical IO.
>> +
>>  Description
>>  -----------
>>  
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea9501afb8..c4faa3464602 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>>  	.llseek		= default_llseek,
>>  };
>>  
>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
>> +			     loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	ssize_t rc = 0;
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (task->flags & PF_MEMALLOC_NOIO)
>> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>> +	put_task_struct(task);
>> +	return rc;
>> +}
>> +
>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	char buffer[5];
>> +	int rc = count;
>> +
>> +	memset(buffer, 0, sizeof(buffer));
>> +	if (count != sizeof(buffer) - 1)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(buffer, buf, count))
>> +		return -EFAULT;
>> +	buffer[count] = '\0';
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (!strcmp(buffer, "noio")) {
>> +		task->flags |= PF_MEMALLOC_NOIO;
>> +	} else {
>> +		rc = -EINVAL;
>> +	}
> 
> Really? Without any privilege check? So any random user can tap into
> __GFP_NOIO allocations?

That was a mistake on my part. I will add it in.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 12:41     ` Kirill A. Shutemov
@ 2019-09-10 13:37       ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2019-09-10 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Christie, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block, Miklos Szeredi

+ Miklos

On 2019/09/10 13:41, Kirill A. Shutemov wrote:
> On Tue, Sep 10, 2019 at 12:05:33PM +0000, Damien Le Moal wrote:
>> On 2019/09/10 11:00, Kirill A. Shutemov wrote:
>>> On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>> have userspace components that can run in the IO path. For example,
>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>> to figure out the state of paths and re-set them up.
>>>>
>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>> but for userspace we would end up hitting a allocation that ended up
>>>> writing data back to the same device we are trying to allocate for.
>>>>
>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>> depending on what other drivers and userspace file systems need, for
>>>> the final version I can add the other flags for that file or do a file
>>>> per flag or just do a memalloc_noio file.
>>>>
>>>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>>>> ---
>>>>  Documentation/filesystems/proc.txt |  6 ++++
>>>>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>>>>  2 files changed, 59 insertions(+)
>>>>
>>>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>>>> index 99ca040e3f90..b5456a61a013 100644
>>>> --- a/Documentation/filesystems/proc.txt
>>>> +++ b/Documentation/filesystems/proc.txt
>>>> @@ -46,6 +46,7 @@ Table of Contents
>>>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>>>    3.12	/proc/<pid>/arch_status - Task architecture specific information
>>>> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>>>  
>>>>    4	Configuring procfs
>>>>    4.1	Mount options
>>>> @@ -1980,6 +1981,11 @@ Example
>>>>   $ cat /proc/6753/arch_status
>>>>   AVX512_elapsed_ms:      8
>>>>  
>>>> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>>> +-----------------------------------------------------------------------
>>>> +A value of "noio" indicates that when a task allocates memory it will not
>>>> +reclaim memory that requires starting phisical IO.
>>>> +
>>>>  Description
>>>>  -----------
>>>>  
>>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>> index ebea9501afb8..c4faa3464602 100644
>>>> --- a/fs/proc/base.c
>>>> +++ b/fs/proc/base.c
>>>> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>>>>  	.llseek		= default_llseek,
>>>>  };
>>>>  
>>>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
>>>> +			     loff_t *ppos)
>>>> +{
>>>> +	struct task_struct *task;
>>>> +	ssize_t rc = 0;
>>>> +
>>>> +	task = get_proc_task(file_inode(file));
>>>> +	if (!task)
>>>> +		return -ESRCH;
>>>> +
>>>> +	if (task->flags & PF_MEMALLOC_NOIO)
>>>> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>>>> +	put_task_struct(task);
>>>> +	return rc;
>>>> +}
>>>> +
>>>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>>>> +			      size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct task_struct *task;
>>>> +	char buffer[5];
>>>> +	int rc = count;
>>>> +
>>>> +	memset(buffer, 0, sizeof(buffer));
>>>> +	if (count != sizeof(buffer) - 1)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (copy_from_user(buffer, buf, count))
>>>> +		return -EFAULT;
>>>> +	buffer[count] = '\0';
>>>> +
>>>> +	task = get_proc_task(file_inode(file));
>>>> +	if (!task)
>>>> +		return -ESRCH;
>>>> +
>>>> +	if (!strcmp(buffer, "noio")) {
>>>> +		task->flags |= PF_MEMALLOC_NOIO;
>>>> +	} else {
>>>> +		rc = -EINVAL;
>>>> +	}
>>>
>>> Really? Without any privilege check? So any random user can tap into
>>> __GFP_NOIO allocations?
>>
>> OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since
>> these storage daemons are generally run as root anyway, that would still work
>> for most setup I think.
>>
>>>
>>> NAK.
>>>
>>> I don't think that it's great idea in general to expose this low-level
>>> machinery to userspace. But it's better to get comment from people move
>>> familiar with reclaim path.
>>
>> Any setup with stacked file systems and one of the IO path component being a
>> user level process can benefit from this. See the problem described in this
>> patch I pushed for (unsuccessfully as it was a heavy handed solution):
>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html
>>
>> As the discussion in this thread shows, there is no existing simple solution to
>> deal with this reclaim recursion problem. And automatic detection is too hard,
>> if at all possible. With the proper access rights added, this user accessible
>> interface does look very sensible to me.
> 
> Looking into the thread, have you find out if there's anything on FUSE
> side that helps it to avoid deadlocks? Or FUSE just relies on luck with
> this?

I did not see anything relevant. The nofs allocations seem to all be in the
writpage/writepages methods for the client side, to prepare requests to send to
the fuse daemon serving them. I think that that is equivalent to a regular FS
(e.g. XFS) using NOFS allocations during writeback on top of the emulated device
served by a user level daemon (e.g. tcmu-runner in the problem case I reported).
So it does look like a fuse daemon actually serving the request may still
trigger a reclaim into the fuse FS. I wonder if such problem ever was reported
or if there are some clever tricks I am missing.

Miklos,

Could you comment on this ? Is there a mechanism in fuse preventing the
userspace fuse daemon memory triggering a reclaim into the fuse FS being processed ?

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 12:05   ` Damien Le Moal
@ 2019-09-10 12:41     ` Kirill A. Shutemov
  2019-09-10 13:37       ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill A. Shutemov @ 2019-09-10 12:41 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mike Christie, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block

On Tue, Sep 10, 2019 at 12:05:33PM +0000, Damien Le Moal wrote:
> On 2019/09/10 11:00, Kirill A. Shutemov wrote:
> > On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
> >> There are several storage drivers like dm-multipath, iscsi, and nbd that
> >> have userspace components that can run in the IO path. For example,
> >> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> >> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> >> to figure out the state of paths and re-set them up.
> >>
> >> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> >> memalloc_*_save/restore functions to control the allocation behavior,
> >> but for userspace we would end up hitting a allocation that ended up
> >> writing data back to the same device we are trying to allocate for.
> >>
> >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> >> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
> >> depending on what other drivers and userspace file systems need, for
> >> the final version I can add the other flags for that file or do a file
> >> per flag or just do a memalloc_noio file.
> >>
> >> Signed-off-by: Mike Christie <mchristi@redhat.com>
> >> ---
> >>  Documentation/filesystems/proc.txt |  6 ++++
> >>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
> >>  2 files changed, 59 insertions(+)
> >>
> >> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> index 99ca040e3f90..b5456a61a013 100644
> >> --- a/Documentation/filesystems/proc.txt
> >> +++ b/Documentation/filesystems/proc.txt
> >> @@ -46,6 +46,7 @@ Table of Contents
> >>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> >>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> >>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> >> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >>  
> >>    4	Configuring procfs
> >>    4.1	Mount options
> >> @@ -1980,6 +1981,11 @@ Example
> >>   $ cat /proc/6753/arch_status
> >>   AVX512_elapsed_ms:      8
> >>  
> >> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >> +-----------------------------------------------------------------------
> >> +A value of "noio" indicates that when a task allocates memory it will not
> >> +reclaim memory that requires starting phisical IO.
> >> +
> >>  Description
> >>  -----------
> >>  
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index ebea9501afb8..c4faa3464602 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
> >>  	.llseek		= default_llseek,
> >>  };
> >>  
> >> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
> >> +			     loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	ssize_t rc = 0;
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (task->flags & PF_MEMALLOC_NOIO)
> >> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
> >> +	put_task_struct(task);
> >> +	return rc;
> >> +}
> >> +
> >> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	char buffer[5];
> >> +	int rc = count;
> >> +
> >> +	memset(buffer, 0, sizeof(buffer));
> >> +	if (count != sizeof(buffer) - 1)
> >> +		return -EINVAL;
> >> +
> >> +	if (copy_from_user(buffer, buf, count))
> >> +		return -EFAULT;
> >> +	buffer[count] = '\0';
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (!strcmp(buffer, "noio")) {
> >> +		task->flags |= PF_MEMALLOC_NOIO;
> >> +	} else {
> >> +		rc = -EINVAL;
> >> +	}
> > 
> > Really? Without any privilege check? So any random user can tap into
> > __GFP_NOIO allocations?
> 
> OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since
> these storage daemons are generally run as root anyway, that would still work
> for most setup I think.
> 
> > 
> > NAK.
> > 
> > I don't think that it's great idea in general to expose this low-level
> > machinery to userspace. But it's better to get comment from people move
> > familiar with reclaim path.
> 
> Any setup with stacked file systems and one of the IO path component being a
> user level process can benefit from this. See the problem described in this
> patch I pushed for (unsuccessfully as it was a heavy handed solution):
> https://www.spinics.net/lists/linux-fsdevel/msg148912.html
> 
> As the discussion in this thread shows, there is no existing simple solution to
> deal with this reclaim recursion problem. And automatic detection is too hard,
> if at all possible. With the proper access rights added, this user accessible
> interface does look very sensible to me.

Looking into the thread, have you find out if there's anything on FUSE
side that helps it to avoid deadlocks? Or FUSE just relies on luck with
this?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-10 10:00 ` Kirill A. Shutemov
@ 2019-09-10 12:05   ` Damien Le Moal
  2019-09-10 12:41     ` Kirill A. Shutemov
  2019-09-10 16:06   ` Mike Christie
  1 sibling, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2019-09-10 12:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Mike Christie
  Cc: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block

On 2019/09/10 11:00, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>> have userspace components that can run in the IO path. For example,
>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>> to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>>
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>> depending on what other drivers and userspace file systems need, for
>> the final version I can add the other flags for that file or do a file
>> per flag or just do a memalloc_noio file.
>>
>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>> ---
>>  Documentation/filesystems/proc.txt |  6 ++++
>>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>>  2 files changed, 59 insertions(+)
>>
>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> index 99ca040e3f90..b5456a61a013 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -46,6 +46,7 @@ Table of Contents
>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>    3.12	/proc/<pid>/arch_status - Task architecture specific information
>> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>  
>>    4	Configuring procfs
>>    4.1	Mount options
>> @@ -1980,6 +1981,11 @@ Example
>>   $ cat /proc/6753/arch_status
>>   AVX512_elapsed_ms:      8
>>  
>> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
>> +-----------------------------------------------------------------------
>> +A value of "noio" indicates that when a task allocates memory it will not
>> +reclaim memory that requires starting phisical IO.
>> +
>>  Description
>>  -----------
>>  
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea9501afb8..c4faa3464602 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>>  	.llseek		= default_llseek,
>>  };
>>  
>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
>> +			     loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	ssize_t rc = 0;
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (task->flags & PF_MEMALLOC_NOIO)
>> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>> +	put_task_struct(task);
>> +	return rc;
>> +}
>> +
>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	char buffer[5];
>> +	int rc = count;
>> +
>> +	memset(buffer, 0, sizeof(buffer));
>> +	if (count != sizeof(buffer) - 1)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(buffer, buf, count))
>> +		return -EFAULT;
>> +	buffer[count] = '\0';
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (!strcmp(buffer, "noio")) {
>> +		task->flags |= PF_MEMALLOC_NOIO;
>> +	} else {
>> +		rc = -EINVAL;
>> +	}
> 
> Really? Without any privilege check? So any random user can tap into
> __GFP_NOIO allocations?

OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since
these storage daemons are generally run as root anyway, that would still work
for most setup I think.

> 
> NAK.
> 
> I don't think that it's great idea in general to expose this low-level
> machinery to userspace. But it's better to get comment from people move
> familiar with reclaim path.

Any setup with stacked file systems and one of the IO path component being a
user level process can benefit from this. See the problem described in this
patch I pushed for (unsuccessfully as it was a heavy handed solution):
https://www.spinics.net/lists/linux-fsdevel/msg148912.html

As the discussion in this thread shows, there is no existing simple solution to
deal with this reclaim recursion problem. And automatic detection is too hard,
if at all possible. With the proper access rights added, this user accessible
interface does look very sensible to me.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-09 16:28 Mike Christie
  2019-09-09 18:26 ` Mike Christie
@ 2019-09-10 10:00 ` Kirill A. Shutemov
  2019-09-10 12:05   ` Damien Le Moal
  2019-09-10 16:06   ` Mike Christie
  2019-09-11  8:23 ` Bart Van Assche
  2 siblings, 2 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2019-09-10 10:00 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block

On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
> There are several storage drivers like dm-multipath, iscsi, and nbd that
> have userspace components that can run in the IO path. For example,
> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> to figure out the state of paths and re-set them up.
> 
> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> memalloc_*_save/restore functions to control the allocation behavior,
> but for userspace we would end up hitting a allocation that ended up
> writing data back to the same device we are trying to allocate for.
> 
> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
> depending on what other drivers and userspace file systems need, for
> the final version I can add the other flags for that file or do a file
> per flag or just do a memalloc_noio file.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>
> ---
>  Documentation/filesystems/proc.txt |  6 ++++
>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 99ca040e3f90..b5456a61a013 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -46,6 +46,7 @@ Table of Contents
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>  
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1980,6 +1981,11 @@ Example
>   $ cat /proc/6753/arch_status
>   AVX512_elapsed_ms:      8
>  
> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
> +-----------------------------------------------------------------------
> +A value of "noio" indicates that when a task allocates memory it will not
> +reclaim memory that requires starting phisical IO.
> +
>  Description
>  -----------
>  
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea9501afb8..c4faa3464602 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>  	.llseek		= default_llseek,
>  };
>  
> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
> +			     loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	ssize_t rc = 0;
> +
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (task->flags & PF_MEMALLOC_NOIO)
> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
> +	put_task_struct(task);
> +	return rc;
> +}
> +
> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[5];
> +	int rc = count;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count != sizeof(buffer) - 1)
> +		return -EINVAL;
> +
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	buffer[count] = '\0';
> +
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (!strcmp(buffer, "noio")) {
> +		task->flags |= PF_MEMALLOC_NOIO;
> +	} else {
> +		rc = -EINVAL;
> +	}

Really? Without any privilege check? So any random user can tap into
__GFP_NOIO allocations?

NAK.

I don't think that it's great idea in general to expose this low-level
machinery to userspace. But it's better to get comment from people move
familiar with reclaim path.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-09 18:26 ` Mike Christie
@ 2019-09-10  8:35   ` Damien Le Moal
  2019-09-11  8:43     ` Martin Raiber
       [not found]     ` <0102016d1f7af966-334f093b-2a62-4baa-9678-8d90d5fba6d9-000000@eu-west-1.amazonses.com>
  2019-09-10 22:12   ` Tetsuo Handa
  1 sibling, 2 replies; 20+ messages in thread
From: Damien Le Moal @ 2019-09-10  8:35 UTC (permalink / raw)
  To: Mike Christie, axboe, James.Bottomley, martin.petersen,
	linux-kernel, linux-scsi, linux-block, Linux-MM

Mike,

On 2019/09/09 19:26, Mike Christie wrote:
> Forgot to cc linux-mm.
> 
> On 09/09/2019 11:28 AM, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>> have userspace components that can run in the IO path. For example,
>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>> to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>>
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>> depending on what other drivers and userspace file systems need, for
>> the final version I can add the other flags for that file or do a file
>> per flag or just do a memalloc_noio file.

Awesome. That probably will be the perfect solution for the problem we hit with
tcmu-runner a while back (please see this thread:
https://www.spinics.net/lists/linux-fsdevel/msg148912.html).

I think we definitely need nofs as well for dealing with cases where the backend
storage for the user daemon is a file.

I will give this patch a try as soon as possible (I am traveling currently).

Best regards.

>>
>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>> ---
>>  Documentation/filesystems/proc.txt |  6 ++++
>>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>>  2 files changed, 59 insertions(+)
>>
>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> index 99ca040e3f90..b5456a61a013 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -46,6 +46,7 @@ Table of Contents
>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>    3.12	/proc/<pid>/arch_status - Task architecture specific information
>> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>>  
>>    4	Configuring procfs
>>    4.1	Mount options
>> @@ -1980,6 +1981,11 @@ Example
>>   $ cat /proc/6753/arch_status
>>   AVX512_elapsed_ms:      8
>>  
>> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
>> +-----------------------------------------------------------------------
>> +A value of "noio" indicates that when a task allocates memory it will not
>> +reclaim memory that requires starting phisical IO.
>> +
>>  Description
>>  -----------
>>  
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea9501afb8..c4faa3464602 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>>  	.llseek		= default_llseek,
>>  };
>>  
>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
>> +			     loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	ssize_t rc = 0;
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (task->flags & PF_MEMALLOC_NOIO)
>> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>> +	put_task_struct(task);
>> +	return rc;
>> +}
>> +
>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct task_struct *task;
>> +	char buffer[5];
>> +	int rc = count;
>> +
>> +	memset(buffer, 0, sizeof(buffer));
>> +	if (count != sizeof(buffer) - 1)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(buffer, buf, count))
>> +		return -EFAULT;
>> +	buffer[count] = '\0';
>> +
>> +	task = get_proc_task(file_inode(file));
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	if (!strcmp(buffer, "noio")) {
>> +		task->flags |= PF_MEMALLOC_NOIO;
>> +	} else {
>> +		rc = -EINVAL;
>> +	}
>> +
>> +	put_task_struct(task);
>> +	return rc;
>> +}
>> +
>> +static const struct file_operations proc_memalloc_operations = {
>> +	.read		= memalloc_read,
>> +	.write		= memalloc_write,
>> +	.llseek		= default_llseek,
>> +};
>> +
>>  #ifdef CONFIG_AUDIT
>>  #define TMPBUFLEN 11
>>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
>> @@ -3097,6 +3148,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>>  #endif
>> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>>  };
>>  
>>  static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
>> @@ -3487,6 +3539,7 @@ static const struct pid_entry tid_base_stuff[] = {
>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>>  #endif
>> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>>  };
>>  
>>  static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
  2019-09-09 16:28 Mike Christie
@ 2019-09-09 18:26 ` Mike Christie
  2019-09-10  8:35   ` Damien Le Moal
  2019-09-10 22:12   ` Tetsuo Handa
  2019-09-10 10:00 ` Kirill A. Shutemov
  2019-09-11  8:23 ` Bart Van Assche
  2 siblings, 2 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-09 18:26 UTC (permalink / raw)
  To: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block, Linux-MM

Forgot to cc linux-mm.

On 09/09/2019 11:28 AM, Mike Christie wrote:
> There are several storage drivers like dm-multipath, iscsi, and nbd that
> have userspace components that can run in the IO path. For example,
> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> to figure out the state of paths and re-set them up.
> 
> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> memalloc_*_save/restore functions to control the allocation behavior,
> but for userspace we would end up hitting a allocation that ended up
> writing data back to the same device we are trying to allocate for.
> 
> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
> depending on what other drivers and userspace file systems need, for
> the final version I can add the other flags for that file or do a file
> per flag or just do a memalloc_noio file.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>
> ---
>  Documentation/filesystems/proc.txt |  6 ++++
>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 99ca040e3f90..b5456a61a013 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -46,6 +46,7 @@ Table of Contents
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
>  
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1980,6 +1981,11 @@ Example
>   $ cat /proc/6753/arch_status
>   AVX512_elapsed_ms:      8
>  
> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
> +-----------------------------------------------------------------------
> +A value of "noio" indicates that when a task allocates memory it will not
> +reclaim memory that requires starting phisical IO.
> +
>  Description
>  -----------
>  
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea9501afb8..c4faa3464602 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
>  	.llseek		= default_llseek,
>  };
>  
> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
> +			     loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	ssize_t rc = 0;
> +
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (task->flags & PF_MEMALLOC_NOIO)
> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
> +	put_task_struct(task);
> +	return rc;
> +}
> +
> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[5];
> +	int rc = count;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count != sizeof(buffer) - 1)
> +		return -EINVAL;
> +
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	buffer[count] = '\0';
> +
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (!strcmp(buffer, "noio")) {
> +		task->flags |= PF_MEMALLOC_NOIO;
> +	} else {
> +		rc = -EINVAL;
> +	}
> +
> +	put_task_struct(task);
> +	return rc;
> +}
> +
> +static const struct file_operations proc_memalloc_operations = {
> +	.read		= memalloc_read,
> +	.write		= memalloc_write,
> +	.llseek		= default_llseek,
> +};
> +
>  #ifdef CONFIG_AUDIT
>  #define TMPBUFLEN 11
>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> @@ -3097,6 +3148,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>  #endif
> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>  };
>  
>  static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
> @@ -3487,6 +3539,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>  	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
>  #endif
> +	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
>  };
>  
>  static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH] Add proc interface to set PF_MEMALLOC flags
@ 2019-09-09 16:28 Mike Christie
  2019-09-09 18:26 ` Mike Christie
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Mike Christie @ 2019-09-09 16:28 UTC (permalink / raw)
  To: axboe, James.Bottomley, martin.petersen, linux-kernel,
	linux-scsi, linux-block
  Cc: Mike Christie

There are several storage drivers like dm-multipath, iscsi, and nbd that
have userspace components that can run in the IO path. For example,
iscsi and nbd's userspace deamons may need to recreate a socket and/or
send IO on it, and dm-multipath's daemon multipathd may need to send IO
to figure out the state of paths and re-set them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting a allocation that ended up
writing data back to the same device we are trying to allocate for.

This patch allows the userspace deamon to set the PF_MEMALLOC* flags
through procfs. It currently only supports PF_MEMALLOC_NOIO, but
depending on what other drivers and userspace file systems need, for
the final version I can add the other flags for that file or do a file
per flag or just do a memalloc_noio file.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 Documentation/filesystems/proc.txt |  6 ++++
 fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 99ca040e3f90..b5456a61a013 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -46,6 +46,7 @@ Table of Contents
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
   3.12	/proc/<pid>/arch_status - Task architecture specific information
+  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
 
   4	Configuring procfs
   4.1	Mount options
@@ -1980,6 +1981,11 @@ Example
  $ cat /proc/6753/arch_status
  AVX512_elapsed_ms:      8
 
+3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
+-----------------------------------------------------------------------
+A value of "noio" indicates that when a task allocates memory it will not
+reclaim memory that requires starting phisical IO.
+
 Description
 -----------
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..c4faa3464602 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
 	.llseek		= default_llseek,
 };
 
+static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
+			     loff_t *ppos)
+{
+	struct task_struct *task;
+	ssize_t rc = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	if (task->flags & PF_MEMALLOC_NOIO)
+		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
+	put_task_struct(task);
+	return rc;
+}
+
+static ssize_t memalloc_write(struct file *file, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[5];
+	int rc = count;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count != sizeof(buffer) - 1)
+		return -EINVAL;
+
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	buffer[count] = '\0';
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	if (!strcmp(buffer, "noio")) {
+		task->flags |= PF_MEMALLOC_NOIO;
+	} else {
+		rc = -EINVAL;
+	}
+
+	put_task_struct(task);
+	return rc;
+}
+
+static const struct file_operations proc_memalloc_operations = {
+	.read		= memalloc_read,
+	.write		= memalloc_write,
+	.llseek		= default_llseek,
+};
+
 #ifdef CONFIG_AUDIT
 #define TMPBUFLEN 11
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -3097,6 +3148,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3487,6 +3539,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+	REG("memalloc", S_IRUGO|S_IWUSR, proc_memalloc_operations),
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190911031348.9648-1-hdanton@sina.com>
2019-09-11 10:07 ` [RFC PATCH] Add proc interface to set PF_MEMALLOC flags Tetsuo Handa
2019-09-11 15:44   ` Mike Christie
     [not found] ` <20190911135237.11248-1-hdanton@sina.com>
2019-09-11 14:20   ` Tetsuo Handa
2019-09-09 16:28 Mike Christie
2019-09-09 18:26 ` Mike Christie
2019-09-10  8:35   ` Damien Le Moal
2019-09-11  8:43     ` Martin Raiber
     [not found]     ` <0102016d1f7af966-334f093b-2a62-4baa-9678-8d90d5fba6d9-000000@eu-west-1.amazonses.com>
2019-09-11 16:56       ` Mike Christie
2019-09-11 19:21         ` Martin Raiber
2019-09-12 16:22           ` Mike Christie
2019-09-12 16:27             ` Mike Christie
2019-09-10 22:12   ` Tetsuo Handa
2019-09-10 23:28     ` Kirill A. Shutemov
2019-09-11 15:23     ` Mike Christie
2019-09-10 10:00 ` Kirill A. Shutemov
2019-09-10 12:05   ` Damien Le Moal
2019-09-10 12:41     ` Kirill A. Shutemov
2019-09-10 13:37       ` Damien Le Moal
2019-09-10 16:06   ` Mike Christie
2019-09-11  8:23 ` Bart Van Assche

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox