All of lore.kernel.org
 help / color / mirror / Atom feed
* Postpone copy-up until first real write?
@ 2018-06-29  3:35 cgxu519
  2018-06-29 10:28 ` Amir Goldstein
  0 siblings, 1 reply; 9+ messages in thread
From: cgxu519 @ 2018-06-29  3:35 UTC (permalink / raw)
  To: linux-unionfs

Hi guys,

I have a simple question about copy-up timing after implementing stacked 
file operation.
Could we postpone copy-up until first real write happens?

The background is consuming much time on open is unexpected for some of 
our applications.

Thanks,
Chengguang



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2018-06-29  3:35 Postpone copy-up until first real write? cgxu519
@ 2018-06-29 10:28 ` Amir Goldstein
  2018-06-29 11:57   ` cgxu519
  2018-10-08 14:41   ` Miklos Szeredi
  0 siblings, 2 replies; 9+ messages in thread
From: Amir Goldstein @ 2018-06-29 10:28 UTC (permalink / raw)
  To: cgxu519; +Cc: overlayfs, Miklos Szeredi

On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
> Hi guys,
>
> I have a simple question about copy-up timing after implementing stacked
> file operation.
> Could we postpone copy-up until first real write happens?
>
> The background is consuming much time on open is unexpected for some of our
> applications.
>

I need this behavior as well for snapshots.
On first glance it looks feasible to me.
Seems like we could relax copy up on open(O_RDWR) if there were no special
open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
O_ACCMODE flags of file and real->file don't match.

Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
doesn't need to be corrected with copy up.

Miklos? Does this sound reasonable to you?

Chengguang,

Will you take a swing at implementing this if there are no holes found
in the plan?
Do you know if your applications typically open files with O_RDWR? any
other flags?
Deferring copy up to first write with O_WRONLY seems a bit more tricky.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2018-06-29 10:28 ` Amir Goldstein
@ 2018-06-29 11:57   ` cgxu519
  2018-10-08 14:41   ` Miklos Szeredi
  1 sibling, 0 replies; 9+ messages in thread
From: cgxu519 @ 2018-06-29 11:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: overlayfs, Miklos Szeredi

On 06/29/2018 06:28 PM, Amir Goldstein wrote:
> On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
>> Hi guys,
>>
>> I have a simple question about copy-up timing after implementing stacked
>> file operation.
>> Could we postpone copy-up until first real write happens?
>>
>> The background is consuming much time on open is unexpected for some of our
>> applications.
>>
> I need this behavior as well for snapshots.
> On first glance it looks feasible to me.
> Seems like we could relax copy up on open(O_RDWR) if there were no special
> open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
> if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
> O_ACCMODE flags of file and real->file don't match.
>
> Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
> and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
> doesn't need to be corrected with copy up.
>
> Miklos? Does this sound reasonable to you?
>
> Chengguang,
>
> Will you take a swing at implementing this if there are no holes found
> in the plan?
Sure, I'll do if the behavior is welcomed.

> Do you know if your applications typically open files with O_RDWR? any
> other flags?
> Deferring copy up to first write with O_WRONLY seems a bit more tricky.

Most of them open with O_RDWR but I'm not sure there is no O_WRONLY with 
complain in the future. :-/

Thanks,
Chengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2018-06-29 10:28 ` Amir Goldstein
  2018-06-29 11:57   ` cgxu519
@ 2018-10-08 14:41   ` Miklos Szeredi
  2018-10-08 20:23     ` Amir Goldstein
  2019-01-13 11:25     ` Amir Goldstein
  1 sibling, 2 replies; 9+ messages in thread
From: Miklos Szeredi @ 2018-10-08 14:41 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: cgxu519, overlayfs

On Fri, Jun 29, 2018 at 12:28 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
>> Hi guys,
>>
>> I have a simple question about copy-up timing after implementing stacked
>> file operation.
>> Could we postpone copy-up until first real write happens?
>>
>> The background is consuming much time on open is unexpected for some of our
>> applications.
>>
>
> I need this behavior as well for snapshots.
> On first glance it looks feasible to me.
> Seems like we could relax copy up on open(O_RDWR) if there were no special
> open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
> if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
> O_ACCMODE flags of file and real->file don't match.
>
> Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
> and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
> doesn't need to be corrected with copy up.
>
> Miklos? Does this sound reasonable to you?

So now that the stacked file ops are merged we can take a look at
these somewhat related issues:

 - shared r/o mmap of lower file
 - delayed copy up until first write
 - delayed copy up until first write fault on shared r/w mmap.

My idea would be to have three (or four) states for each in core overlay inode:

  LOWER:  opened only for read or not opened at all and mmaped with
MAP_PRIVATE  or not mapped at all
  LOWER_SHARED: opened for read/write or mmaped MAP_SHARED
  UPPER_SHARED: copied up
  (UPPER:  copied up and all open files referring to the inode were
opened in the UPPER state)

The LOWER_SHARED state is basically in anticipation of copy-up,
without actually committing to the upper layer.

The UPPER state could just be an optimization, that would be otherwise
equivalent to the UPPER_SHARED state.

I/O paths for the different states would be:

LOWER/UPPER: just like now: read/write are stacked, mmap goes directly
to "realfile"
*_SHARED: just like a normal filesystem: uses
generic_file_{read|write}_iter, generic_mmap, defines own a_ops,
vm_ops.

So basically the _SHARED mode loses the cache sharing property of
LOWER, but gains the consistency offered by a unified page cache
regardless of which layer the file is stored in.

Thoughts?

Does this make sense wrt. overlay-snapshots?

One detail is whether we should ever make the LOWER_SHARED -> LOWER
transition while the inode is cached.  My gut feeling is that we
probably needn't bother.

Also I'm not sure we'd actually need the UPPER state at all, since (if
implemented right) UPPER_SHARED should offer a very similar cache
footprint and performance.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2018-10-08 14:41   ` Miklos Szeredi
@ 2018-10-08 20:23     ` Amir Goldstein
  2019-01-13 11:25     ` Amir Goldstein
  1 sibling, 0 replies; 9+ messages in thread
From: Amir Goldstein @ 2018-10-08 20:23 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Chengguang Xu, overlayfs

On Mon, Oct 8, 2018 at 5:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Jun 29, 2018 at 12:28 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
> >> Hi guys,
> >>
> >> I have a simple question about copy-up timing after implementing stacked
> >> file operation.
> >> Could we postpone copy-up until first real write happens?
> >>
> >> The background is consuming much time on open is unexpected for some of our
> >> applications.
> >>
> >
> > I need this behavior as well for snapshots.
> > On first glance it looks feasible to me.
> > Seems like we could relax copy up on open(O_RDWR) if there were no special
> > open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
> > if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
> > O_ACCMODE flags of file and real->file don't match.
> >
> > Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
> > and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
> > doesn't need to be corrected with copy up.
> >
> > Miklos? Does this sound reasonable to you?
>
> So now that the stacked file ops are merged we can take a look at
> these somewhat related issues:
>
>  - shared r/o mmap of lower file
>  - delayed copy up until first write
>  - delayed copy up until first write fault on shared r/w mmap.
>
> My idea would be to have three (or four) states for each in core overlay inode:
>
>   LOWER:  opened only for read or not opened at all and mmaped with
> MAP_PRIVATE  or not mapped at all
>   LOWER_SHARED: opened for read/write or mmaped MAP_SHARED
>   UPPER_SHARED: copied up
>   (UPPER:  copied up and all open files referring to the inode were
> opened in the UPPER state)
>
> The LOWER_SHARED state is basically in anticipation of copy-up,
> without actually committing to the upper layer.
>
> The UPPER state could just be an optimization, that would be otherwise
> equivalent to the UPPER_SHARED state.
>
> I/O paths for the different states would be:
>
> LOWER/UPPER: just like now: read/write are stacked, mmap goes directly
> to "realfile"
> *_SHARED: just like a normal filesystem: uses
> generic_file_{read|write}_iter, generic_mmap, defines own a_ops,
> vm_ops.
>
> So basically the _SHARED mode loses the cache sharing property of
> LOWER, but gains the consistency offered by a unified page cache
> regardless of which layer the file is stored in.
>
> Thoughts?

Sounds good and wishfully not too complicated?
I am not familiar enough with mm to understand what we are up
against.

Does IO from upper fs backing inode goes directly to overlay inode
page cache or first to upper fs inode page cache and then moving
pages to overlay inode page cache?
Am I over complicating this?
I suppose sharing pages with LOWER left for another time?

>
> Does this make sense wrt. overlay-snapshots?
>

Snapshots needs the a_ops/vm_ops to COW on page faults
and need delayed copy up on first write (after snapshot take),
so all this will be very useful.

> One detail is whether we should ever make the LOWER_SHARED -> LOWER
> transition while the inode is cached.  My gut feeling is that we
> probably needn't bother.
>
> Also I'm not sure we'd actually need the UPPER state at all, since (if
> implemented right) UPPER_SHARED should offer a very similar cache
> footprint and performance.
>

If we don't have UPPER state then we won't need to call
sync_filesystem(upper_sb) in ovl_sync_fs(), only
sb->s_op->sync_fs(), right? so that is a win for simplicity.

Same for implementing ovl_freeze(). It should mostly "just work".
Snapshots need freeze, but snapshots can turn off the UPPER state
optimization if it is at all implemented.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2018-10-08 14:41   ` Miklos Szeredi
  2018-10-08 20:23     ` Amir Goldstein
@ 2019-01-13 11:25     ` Amir Goldstein
  2019-01-14  8:21       ` cgxu519
  2019-01-15  9:19       ` Miklos Szeredi
  1 sibling, 2 replies; 9+ messages in thread
From: Amir Goldstein @ 2019-01-13 11:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cgxu519, overlayfs

On Mon, Oct 8, 2018 at 5:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Jun 29, 2018 at 12:28 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
> >> Hi guys,
> >>
> >> I have a simple question about copy-up timing after implementing stacked
> >> file operation.
> >> Could we postpone copy-up until first real write happens?
> >>
> >> The background is consuming much time on open is unexpected for some of our
> >> applications.
> >>
> >

Getting back to this with some more questions...

> > I need this behavior as well for snapshots.
> > On first glance it looks feasible to me.
> > Seems like we could relax copy up on open(O_RDWR) if there were no special
> > open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
> > if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
> > O_ACCMODE flags of file and real->file don't match.
> >
> > Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
> > and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
> > doesn't need to be corrected with copy up.
> >
> > Miklos? Does this sound reasonable to you?
>
> So now that the stacked file ops are merged we can take a look at
> these somewhat related issues:
>
>  - shared r/o mmap of lower file
>  - delayed copy up until first write
>  - delayed copy up until first write fault on shared r/w mmap.
>

These 3 issues map closely together, but do not cover all "delayed copy up"
cases. Even if we do implement copy up on first write fault, we'd still need
to perform delayed copy up for fallocate(),clone_range(),dedupe_range(),
so perhaps my suggestion above for simpler(?) first step implementation of
"delayed copy up until first write" without bundling it together with the
shared maps issues still makes some sense?

> My idea would be to have three (or four) states for each in core overlay inode:
>
>   LOWER:  opened only for read or not opened at all and mmaped with
> MAP_PRIVATE  or not mapped at all
>   LOWER_SHARED: opened for read/write or mmaped MAP_SHARED
>   UPPER_SHARED: copied up
>   (UPPER:  copied up and all open files referring to the inode were
> opened in the UPPER state)
>
> The LOWER_SHARED state is basically in anticipation of copy-up,
> without actually committing to the upper layer.
>
> The UPPER state could just be an optimization, that would be otherwise
> equivalent to the UPPER_SHARED state.
>
> I/O paths for the different states would be:
>
> LOWER/UPPER: just like now: read/write are stacked, mmap goes directly
> to "realfile"
> *_SHARED: just like a normal filesystem: uses
> generic_file_{read|write}_iter, generic_mmap, defines own a_ops,
> vm_ops.
>
> So basically the _SHARED mode loses the cache sharing property of
> LOWER, but gains the consistency offered by a unified page cache
> regardless of which layer the file is stored in.
>
> Thoughts?
>

Did you give any consideration to the locking order of copy up
from write fault context?

I suppose you are aware of this hence the proposal of pre_mmap op:
https://marc.info/?l=linux-fsdevel&m=152760695502640&w=2

Do we gain anything (simplicity? relaxed locking order) if we always
copy up metadata on open O_RDWR and only delay data copy up
until first write/page fault?

Chengguang,

For your use case, is it acceptable if we copy up metadata
(and parents) before first write?

Anyway, if no one else is working on this I am willing to take
a swing at it, but it seems to me like it would be best to leave
stacked a_ops for phase two. Thoughts?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2019-01-13 11:25     ` Amir Goldstein
@ 2019-01-14  8:21       ` cgxu519
  2019-01-18 13:52         ` Amir Goldstein
  2019-01-15  9:19       ` Miklos Szeredi
  1 sibling, 1 reply; 9+ messages in thread
From: cgxu519 @ 2019-01-14  8:21 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi; +Cc: overlayfs


On 1/13/19 7:25 PM, Amir Goldstein wrote:
> On Mon, Oct 8, 2018 at 5:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Jun 29, 2018 at 12:28 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Fri, Jun 29, 2018 at 6:35 AM, cgxu519 <cgxu519@gmx.com> wrote:
>>>> Hi guys,
>>>>
>>>> I have a simple question about copy-up timing after implementing stacked
>>>> file operation.
>>>> Could we postpone copy-up until first real write happens?
>>>>
>>>> The background is consuming much time on open is unexpected for some of our
>>>> applications.
>>>>
> Getting back to this with some more questions...
>
>>> I need this behavior as well for snapshots.
>>> On first glance it looks feasible to me.
>>> Seems like we could relax copy up on open(O_RDWR) if there were no special
>>> open flags (O_TRUNC, O_CREATE, O_APPEND...), then mask out O_ACCMODE
>>> if opening a lower inode and defer copy up to ovl_real_fdget_meta() in case the
>>> O_ACCMODE flags of file and real->file don't match.
>>>
>>> Then we can introduce ovl_real_fdget_read() call to be used in ovl_read_iter()
>>> and ovl_copyfile() (in_file) in which cases the mismatch O_ACCMODE flags
>>> doesn't need to be corrected with copy up.
>>>
>>> Miklos? Does this sound reasonable to you?
>> So now that the stacked file ops are merged we can take a look at
>> these somewhat related issues:
>>
>>   - shared r/o mmap of lower file
>>   - delayed copy up until first write
>>   - delayed copy up until first write fault on shared r/w mmap.
>>
> These 3 issues map closely together, but do not cover all "delayed copy up"
> cases. Even if we do implement copy up on first write fault, we'd still need
> to perform delayed copy up for fallocate(),clone_range(),dedupe_range(),
> so perhaps my suggestion above for simpler(?) first step implementation of
> "delayed copy up until first write" without bundling it together with the
> shared maps issues still makes some sense?
>
>> My idea would be to have three (or four) states for each in core overlay inode:
>>
>>    LOWER:  opened only for read or not opened at all and mmaped with
>> MAP_PRIVATE  or not mapped at all
>>    LOWER_SHARED: opened for read/write or mmaped MAP_SHARED
>>    UPPER_SHARED: copied up
>>    (UPPER:  copied up and all open files referring to the inode were
>> opened in the UPPER state)
>>
>> The LOWER_SHARED state is basically in anticipation of copy-up,
>> without actually committing to the upper layer.
>>
>> The UPPER state could just be an optimization, that would be otherwise
>> equivalent to the UPPER_SHARED state.
>>
>> I/O paths for the different states would be:
>>
>> LOWER/UPPER: just like now: read/write are stacked, mmap goes directly
>> to "realfile"
>> *_SHARED: just like a normal filesystem: uses
>> generic_file_{read|write}_iter, generic_mmap, defines own a_ops,
>> vm_ops.
>>
>> So basically the _SHARED mode loses the cache sharing property of
>> LOWER, but gains the consistency offered by a unified page cache
>> regardless of which layer the file is stored in.
>>
>> Thoughts?
>>
> Did you give any consideration to the locking order of copy up
> from write fault context?
>
> I suppose you are aware of this hence the proposal of pre_mmap op:
> https://marc.info/?l=linux-fsdevel&m=152760695502640&w=2
>
> Do we gain anything (simplicity? relaxed locking order) if we always
> copy up metadata on open O_RDWR and only delay data copy up
> until first write/page fault?
>
> Chengguang,
>
> For your use case, is it acceptable if we copy up metadata
> (and parents) before first write?

In our use case that will be fine if copy up metadata(and parents) is 
quick enough.



>
> Anyway, if no one else is working on this I am willing to take
> a swing at it, but it seems to me like it would be best to leave
> stacked a_ops for phase two. Thoughts?

I'm sorry that I could not focus on this because now I'm in some
other urgent projects. So please take it up if you can.

Thanks,
Chengguang.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2019-01-13 11:25     ` Amir Goldstein
  2019-01-14  8:21       ` cgxu519
@ 2019-01-15  9:19       ` Miklos Szeredi
  1 sibling, 0 replies; 9+ messages in thread
From: Miklos Szeredi @ 2019-01-15  9:19 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: cgxu519, overlayfs

On Sun, Jan 13, 2019 at 12:25 PM Amir Goldstein <amir73il@gmail.com> wrote:

> Did you give any consideration to the locking order of copy up
> from write fault context?
>
> I suppose you are aware of this hence the proposal of pre_mmap op:
> https://marc.info/?l=linux-fsdevel&m=152760695502640&w=2

Apparently ->fault() is allowed to release mmap_sem and return
VM_FAULT_RETRY, which is perfect for doing the copy up.

> Do we gain anything (simplicity? relaxed locking order) if we always
> copy up metadata on open O_RDWR and only delay data copy up
> until first write/page fault?

Yeah, maybe doing that as a start makes sense.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Postpone copy-up until first real write?
  2019-01-14  8:21       ` cgxu519
@ 2019-01-18 13:52         ` Amir Goldstein
  0 siblings, 0 replies; 9+ messages in thread
From: Amir Goldstein @ 2019-01-18 13:52 UTC (permalink / raw)
  To: cgxu519; +Cc: Miklos Szeredi, overlayfs

On Mon, Jan 14, 2019 at 10:21 AM cgxu519 <cgxu519@gmx.com> wrote:
[...]
> > Do we gain anything (simplicity? relaxed locking order) if we always
> > copy up metadata on open O_RDWR and only delay data copy up
> > until first write/page fault?
> >
> > Chengguang,
> >
> > For your use case, is it acceptable if we copy up metadata
> > (and parents) before first write?
>
> In our use case that will be fine if copy up metadata(and parents) is
> quick enough.
>

Chengguang,

I posted an RFC patch (sorry forgot to CC you).
It seems to be working as expected short of the mmap case,
but I only tested manually - have not written any tests yet.

You may test the RFC patch and let me know if it meets the requirements
of your use case.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-18 13:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29  3:35 Postpone copy-up until first real write? cgxu519
2018-06-29 10:28 ` Amir Goldstein
2018-06-29 11:57   ` cgxu519
2018-10-08 14:41   ` Miklos Szeredi
2018-10-08 20:23     ` Amir Goldstein
2019-01-13 11:25     ` Amir Goldstein
2019-01-14  8:21       ` cgxu519
2019-01-18 13:52         ` Amir Goldstein
2019-01-15  9:19       ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.