On 5/18/21 8:56 PM, Jane Chu wrote:
> On 5/18/2021 10:27 AM, Joao Martins wrote:
>
>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>> struct completion done;
>>>>>>> enum memory_type type;
>>>>>>> unsigned int flags;
>>>>>>> + unsigned long align;
>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>> means "use non-compound base pages".
>> [...]
>>
>>>>>> The non-zero value must be
>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>> Hmm, maybe it should be an
>>>>>> enum:
>>>>>>
>>>>>> enum devmap_geometry {
>>>>>> DEVMAP_PTE,
>>>>>> DEVMAP_PMD,
>>>>>> DEVMAP_PUD,
>>>>>> }
>>>>>>
>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>>
>>> Gotcha.
>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>> represents a slightly different variation of the device @align i.e. how the metadata is
>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>
>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>> will need to create compound pages representing a page size of @align value.
>>
>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>> just the page size used for the vmemmap of the dax device. Additionally, limiting its
>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>> create compound pages of order 10 for the said vmemmap.
>>
>> I am going to wait until you finish reviewing the remaining four patches of this series,
>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>> setter/getter to audit its value? Thoughts?
>
> Good points there.
>
> My understanding is that dax->align conveys granularity of size while
> carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle
> disk. I tend to think that device pagesize has almost no relation to "align" in that, it's
> possible to have 1G "align" and 4K pagesize, or verse versa. That is, with the advent of compound page
> support, it is possible to totally separate the two concepts.
>
> How about adding a new option to "ndctl create-namespace" that describes
> device creator's desired pagesize, and another parameter to describe whether the pagesize shall
> be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it
> would be possible to save a lot metadata space on the device?
Maybe that can be selected by the driver too, but it's an interesting point you raise
should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
suggestion?). device-dax for example would use geometry == align and therefore save space
(like what I propose in patch 10). But fsdax would retain the default that is geometry =
PAGE_SIZE and align = PMD_SIZE should it want to split pages.
Interestingly, devmap poisoning always occur at @align level regardless of @geometry.
What I am not sure is what value (vs added complexity) it brings to allow geometry *value*
to be selecteable by user given that so far we seem to only ever initialize metadata as
either sets of base pages [*] or sets of compound pages (of a size). And the difference
between both can possibly be summarized to split-ability like you say.
[*] that optionally can are morphed into compound pages by driver
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, May 18, 2021 at 01:08:27PM +0200, Michal Hocko wrote: > On Tue 18-05-21 12:35:36, David Hildenbrand wrote: > > On 18.05.21 12:31, Michal Hocko wrote: > > > > > > Although I have to say openly that I am not a great fan of VM_FAULT_OOM > > > in general. It is usually a a wrong way to tell the handle the failure > > > because it happens outside of the allocation context so you lose all the > > > details (e.g. allocation constrains, numa policy etc.). Also whenever > > > there is ENOMEM then the allocation itself has already made sure that > > > all the reclaim attempts have been already depleted. Just consider an > > > allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate > > > ENOMEM up the call stack. Turning that into the OOM killer sounds like a > > > bad idea to me. But that is a more general topic. I have tried to bring > > > this up in the past but there was not much of an interest to fix it as > > > it was not a pressing problem... > > > > > > > I'm certainly interested; it would mean that we actually want to try > > recovering from VM_FAULT_OOM in various cases, and as you state, we might > > have to supply more information to make that work reliably. > > Or maybe we want to get rid of VM_FAULT_OOM altogether... But this is > really tangent to this discussion. The only relation is that this would > be another place to check when somebody wants to go that direction. If we are to get rid of VM_FAULT_OOM, vmf_error() would be updated and this place will get the update automagically. > > Having that said, I guess what we have here is just the same as when our > > process fails to allocate a generic page table in __handle_mm_fault(), when > > we fail p4d_alloc() and friends ... > > From a quick look it is really similar in a sense that it effectively never > happens and if it does then it certainly does the wrong thing. The point > I was trying to make is that there is likely no need to go that way. As David pointed out, failure to handle direct map in secretmem_fault() is like any allocation failure in page fault handling and most of them result in VM_FAULT_OOM, so I think that having vmf_error() in secretmem_fault() is more consistent with the rest of the code than using VM_FAULT_SIGBUS. Besides if the direct map manipulation failures would result in errors other than -ENOMEM, having vmf_error() may prove useful. -- Sincerely yours, Mike. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 5/8/21 10:06 AM, Vaibhav Jain wrote: > Currently drc_pmem_qeury_stats() generates a dev_err in case > "Enable Performance Information Collection" feature is disabled from > HMC or performance stats are not available for an nvdimm. The error is > of the form below: > > papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query > performance stats, Err:-10 > > This error message confuses users as it implies a possible problem > with the nvdimm even though its due to a disabled/unavailable > feature. We fix this by explicitly handling the H_AUTHORITY and > H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall. > > In case of H_AUTHORITY error an info message is logged instead of an > error, saying that "Permission denied while accessing performance > stats" and an EPERM error is returned back. > > In case of H_UNSUPPORTED error we return a EOPNOTSUPP error back from > drc_pmem_query_stats() indicating that performance stats-query > operation is not supported on this nvdimm. > > Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> > --- > Changelog > > v3: > * Return EOPNOTSUPP error in case of H_UNSUPPORTED [ Ira ] > * Return EPERM in case of H_AUTHORITY [ Ira ] > * Updated patch description > Patch looks good to me. Reviewed-By: Kajol Jain<kjain@linux.ibm.com> Thanks, Kajol Jain > v2: > * Updated the message logged in case of H_AUTHORITY error [ Ira ] > * Switched from dev_warn to dev_info in case of H_AUTHORITY error. > * Instead of -EPERM return -EACCESS for H_AUTHORITY error. > * Added explicit handling of H_UNSUPPORTED error. > --- > arch/powerpc/platforms/pseries/papr_scm.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c > index ef26fe40efb0..e2b69cc3beaf 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, > dev_err(&p->pdev->dev, > "Unknown performance stats, Err:0x%016lX\n", ret[0]); > return -ENOENT; > + } else if (rc == H_AUTHORITY) { > + dev_info(&p->pdev->dev, > + "Permission denied while accessing performance stats"); > + return -EPERM; > + } else if (rc == H_UNSUPPORTED) { > + dev_dbg(&p->pdev->dev, "Performance stats unsupported\n"); > + return -EOPNOTSUPP; > } else if (rc != H_SUCCESS) { > dev_err(&p->pdev->dev, > "Failed to query performance stats, Err:%lld\n", rc); > _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who are going to be deduped. After that, call compare range function only when files are both DAX or not. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- fs/xfs/xfs_file.c | 2 +- fs/xfs/xfs_inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_reflink.c | 4 ++-- 4 files changed, 61 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 38d8eca05aee..bd5002d38df4 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -823,7 +823,7 @@ xfs_wait_dax_page( xfs_ilock(ip, XFS_MMAPLOCK_EXCL); } -static int +int xfs_break_dax_layouts( struct inode *inode, bool *retry) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 0369eb22c1bb..d5e2791969ba 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3711,6 +3711,59 @@ xfs_iolock_two_inodes_and_break_layout( return 0; } +static int +xfs_mmaplock_two_inodes_and_break_dax_layout( + struct xfs_inode *ip1, + struct xfs_inode *ip2) +{ + int error, attempts = 0; + bool retry; + struct page *page; + struct xfs_log_item *lp; + + if (ip1->i_ino > ip2->i_ino) + swap(ip1, ip2); + +again: + retry = false; + /* Lock the first inode */ + xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); + error = xfs_break_dax_layouts(VFS_I(ip1), &retry); + if (error || retry) { + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); + goto again; + } + + if (ip1 == ip2) + return 0; + + /* Nested lock the second inode */ + lp = &ip1->i_itemp->ili_item; + if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) { + if (!xfs_ilock_nowait(ip2, + xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) { + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); + if ((++attempts % 5) == 0) + delay(1); /* Don't just spin the CPU */ + goto again; + } + } else + xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1)); + /* + * We cannot use xfs_break_dax_layouts() directly here because it may + * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable + * for this nested lock case. + */ + page = dax_layout_busy_page(VFS_I(ip2)->i_mapping); + if (page && page_ref_count(page) != 1) { + xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL); + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); + goto again; + } + + return 0; +} + /* * Lock two inodes so that userspace cannot initiate I/O via file syscalls or * mmap activity. @@ -3725,6 +3778,10 @@ xfs_ilock2_io_mmap( ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2)); if (ret) return ret; + + if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) + return xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2); + if (ip1 == ip2) xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); else diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index ca826cfba91c..2d0b344fb100 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -457,6 +457,7 @@ enum xfs_prealloc_flags { int xfs_update_prealloc_flags(struct xfs_inode *ip, enum xfs_prealloc_flags flags); +int xfs_break_dax_layouts(struct inode *inode, bool *retry); int xfs_break_layouts(struct inode *inode, uint *iolock, enum layout_break_reason reason); diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 9a780948dbd0..ff308304c5cd 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1324,8 +1324,8 @@ xfs_reflink_remap_prep( if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest)) goto out_unlock; - /* Don't share DAX file data for now. */ - if (IS_DAX(inode_in) || IS_DAX(inode_out)) + /* Don't share DAX file data with non-DAX file. */ + if (IS_DAX(inode_in) != IS_DAX(inode_out)) goto out_unlock; if (!IS_DAX(inode_in)) -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After CoW, new allocated extents needs to be remapped to the file. So, add an iomap_end for dax write ops to do the remapping work. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- fs/xfs/xfs_bmap_util.c | 3 +-- fs/xfs/xfs_file.c | 9 +++------ fs/xfs/xfs_iomap.c | 38 +++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_iomap.h | 24 ++++++++++++++++++++++++ fs/xfs/xfs_iops.c | 7 +++---- fs/xfs/xfs_reflink.c | 3 +-- 6 files changed, 69 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index a5e9d7d34023..2a36dc93ff27 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -965,8 +965,7 @@ xfs_free_file_space( return 0; if (offset + len > XFS_ISIZE(ip)) len = XFS_ISIZE(ip) - offset; - error = iomap_zero_range(VFS_I(ip), offset, len, NULL, - &xfs_buffered_write_iomap_ops); + error = xfs_iomap_zero_range(ip, offset, len, NULL); if (error) return error; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 396ef36dcd0a..38d8eca05aee 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -684,11 +684,8 @@ xfs_file_dax_write( pos = iocb->ki_pos; trace_xfs_file_dax_write(iocb, from); - ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops); - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) { - i_size_write(inode, iocb->ki_pos); - error = xfs_setfilesize(ip, pos, ret); - } + ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops); + out: if (iolock) xfs_iunlock(ip, iolock); @@ -1309,7 +1306,7 @@ __xfs_filemap_fault( ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, (write_fault && !vmf->cow_page) ? - &xfs_direct_write_iomap_ops : + &xfs_dax_write_iomap_ops : &xfs_read_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index d154f42e2dc6..938723aa137d 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin( /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared, - &lockmode, flags & IOMAP_DIRECT); + &lockmode, + (flags & IOMAP_DIRECT) || IS_DAX(inode)); if (error) goto out_unlock; if (shared) @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = { .iomap_begin = xfs_direct_write_iomap_begin, }; +static int +xfs_dax_write_iomap_end( + struct inode *inode, + loff_t pos, + loff_t length, + ssize_t written, + unsigned int flags, + struct iomap *iomap) +{ + int error = 0; + struct xfs_inode *ip = XFS_I(inode); + bool cow = xfs_is_cow_inode(ip); + + if (!written) + return 0; + + if (pos + written > i_size_read(inode) && !(flags & IOMAP_FAULT)) { + i_size_write(inode, pos + written); + error = xfs_setfilesize(ip, pos, written); + if (error && cow) { + xfs_reflink_cancel_cow_range(ip, pos, written, true); + return error; + } + } + if (cow) + error = xfs_reflink_end_cow(ip, pos, written); + + return error; +} + +const struct iomap_ops xfs_dax_write_iomap_ops = { + .iomap_begin = xfs_direct_write_iomap_begin, + .iomap_end = xfs_dax_write_iomap_end, +}; + static int xfs_buffered_write_iomap_begin( struct inode *inode, diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index 7d3703556d0e..fbacf638ab21 100644 --- a/fs/xfs/xfs_iomap.h +++ b/fs/xfs/xfs_iomap.h @@ -42,8 +42,32 @@ xfs_aligned_fsb_count( extern const struct iomap_ops xfs_buffered_write_iomap_ops; extern const struct iomap_ops xfs_direct_write_iomap_ops; +extern const struct iomap_ops xfs_dax_write_iomap_ops; extern const struct iomap_ops xfs_read_iomap_ops; extern const struct iomap_ops xfs_seek_iomap_ops; extern const struct iomap_ops xfs_xattr_iomap_ops; +static inline int +xfs_iomap_zero_range( + struct xfs_inode *ip, + loff_t offset, + loff_t len, + bool *did_zero) +{ + return iomap_zero_range(VFS_I(ip), offset, len, did_zero, + IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops + : &xfs_buffered_write_iomap_ops); +} + +static inline int +xfs_iomap_truncate_page( + struct xfs_inode *ip, + loff_t pos, + bool *did_zero) +{ + return iomap_truncate_page(VFS_I(ip), pos, did_zero, + IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops + : &xfs_buffered_write_iomap_ops); +} + #endif /* __XFS_IOMAP_H__*/ diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index dfe24b7f26e5..6d936c3e1a6e 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -911,8 +911,8 @@ xfs_setattr_size( */ if (newsize > oldsize) { trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); - error = iomap_zero_range(inode, oldsize, newsize - oldsize, - &did_zeroing, &xfs_buffered_write_iomap_ops); + error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize, + &did_zeroing); } else { /* * iomap won't detect a dirty page over an unwritten block (or a @@ -924,8 +924,7 @@ xfs_setattr_size( newsize); if (error) return error; - error = iomap_truncate_page(inode, newsize, &did_zeroing, - &xfs_buffered_write_iomap_ops); + error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing); } if (error) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index d25434f93235..9a780948dbd0 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof( return 0; trace_xfs_zero_eof(ip, isize, pos - isize); - return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL, - &xfs_buffered_write_iomap_ops); + return xfs_iomap_zero_range(ip, isize, pos - isize, NULL); } /* -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
With dax we cannot deal with readpage() etc. So, we create a dax comparison funciton which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- fs/dax.c | 66 ++++++++++++++++++++++++++++++++++++++++++++ fs/remap_range.c | 36 ++++++++++++++++++------ fs/xfs/xfs_reflink.c | 8 ++++-- include/linux/dax.h | 8 ++++++ include/linux/fs.h | 12 +++++--- 5 files changed, 116 insertions(+), 14 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index baee584cb8ae..93f16210847b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1864,3 +1864,69 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, return dax_insert_pfn_mkwrite(vmf, pfn, order); } EXPORT_SYMBOL_GPL(dax_finish_sync_fault); + +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1, + struct inode *ino2, loff_t pos2, loff_t len, void *data, + struct iomap *smap, struct iomap *dmap) +{ + void *saddr, *daddr; + bool *same = data; + int ret; + + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) { + *same = true; + return len; + } + + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) { + *same = false; + return 0; + } + + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE), + &saddr, NULL); + if (ret < 0) + return -EIO; + + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE), + &daddr, NULL); + if (ret < 0) + return -EIO; + + *same = !memcmp(saddr, daddr, len); + return len; +} + +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, bool *is_same, + const struct iomap_ops *ops) +{ + int id, ret = 0; + + id = dax_read_lock(); + while (len) { + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops, + is_same, dax_range_compare_actor); + if (ret < 0 || !*is_same) + goto out; + + len -= ret; + srcoff += ret; + destoff += ret; + } + ret = 0; +out: + dax_read_unlock(id); + return ret; +} +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare); + +int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *len, unsigned int remap_flags, + const struct iomap_ops *ops) +{ + return __generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, ops); +} +EXPORT_SYMBOL(dax_remap_file_range_prep); diff --git a/fs/remap_range.c b/fs/remap_range.c index e4a5fdd7ad7b..4cfc1553f3bf 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -14,6 +14,7 @@ #include <linux/compat.h> #include <linux/mount.h> #include <linux/fs.h> +#include <linux/dax.h> #include "internal.h" #include <linux/uaccess.h> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2) * Compare extents of two files to see if they are the same. * Caller must have locked both inodes to prevent write races. */ -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, - struct inode *dest, loff_t destoff, - loff_t len, bool *is_same) +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, + loff_t len, bool *is_same) { loff_t src_poff; loff_t dest_poff; @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, out_error: return error; } +EXPORT_SYMBOL(vfs_dedupe_file_range_compare); /* * Check that the two inodes are eligible for cloning, the ranges make @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, * If there's an error, then the usual negative error code is returned. * Otherwise returns 0 with *len set to the request length. */ -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - loff_t *len, unsigned int remap_flags) +int +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *len, unsigned int remap_flags, + const struct iomap_ops *dax_read_ops) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, if (remap_flags & REMAP_FILE_DEDUP) { bool is_same = false; - ret = vfs_dedupe_file_range_compare(inode_in, pos_in, - inode_out, pos_out, *len, &is_same); + if (!IS_DAX(inode_in)) + ret = vfs_dedupe_file_range_compare(inode_in, pos_in, + inode_out, pos_out, *len, &is_same); + else if (dax_read_ops) + ret = dax_dedupe_file_range_compare(inode_in, pos_in, + inode_out, pos_out, *len, &is_same, + dax_read_ops); + else + return -EINVAL; if (ret) return ret; if (!is_same) @@ -370,6 +381,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, return ret; } +EXPORT_SYMBOL(__generic_remap_file_range_prep); + +int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *len, unsigned int remap_flags) +{ + return __generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, NULL); +} EXPORT_SYMBOL(generic_remap_file_range_prep); loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 060695d6d56a..d25434f93235 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1329,8 +1329,12 @@ xfs_reflink_remap_prep( if (IS_DAX(inode_in) || IS_DAX(inode_out)) goto out_unlock; - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, - len, remap_flags); + if (!IS_DAX(inode_in)) + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags); + else + ret = dax_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, &xfs_read_iomap_ops); if (ret || *len == 0) goto out_unlock; diff --git a/include/linux/dax.h b/include/linux/dax.h index 3275e01ed33d..106d1f033a78 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -239,6 +239,14 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap, struct iomap *srcmap); +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, + loff_t len, bool *is_same, + const struct iomap_ops *ops); +int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *len, unsigned int remap_flags, + const struct iomap_ops *ops); static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); diff --git a/include/linux/fs.h b/include/linux/fs.h index c3c88fdb9b2a..deed4371f34f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -71,6 +71,7 @@ struct fsverity_operations; struct fs_context; struct fs_parameter_spec; struct fileattr; +struct iomap_ops; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -2126,10 +2127,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, size_t len, unsigned int flags); -extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - loff_t *count, - unsigned int remap_flags); +int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *len, unsigned int remap_flags, + const struct iomap_ops *dax_read_ops); +int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t *count, unsigned int remap_flags); extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
Some operations, such as comparing a range of data in two files under fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus, we introduce iomap_apply2() to accept arguments from two files and iomap_actor2_t for actions on two files. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> --- fs/iomap/apply.c | 52 +++++++++++++++++++++++++++++++++++++++++++ include/linux/iomap.h | 7 +++++- 2 files changed, 58 insertions(+), 1 deletion(-) diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index 26ab6563181f..0493da5286ad 100644 --- a/fs/iomap/apply.c +++ b/fs/iomap/apply.c @@ -97,3 +97,55 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, return written ? written : ret; } + +loff_t +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2, + loff_t length, unsigned int flags, const struct iomap_ops *ops, + void *data, iomap_actor2_t actor) +{ + struct iomap smap = { .type = IOMAP_HOLE }; + struct iomap dmap = { .type = IOMAP_HOLE }; + loff_t written = 0, ret, ret2 = 0; + loff_t len1 = length, len2, min_len; + + ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL); + if (ret) + goto out; + if (WARN_ON(smap.offset > pos1)) { + written = -EIO; + goto out_src; + } + if (WARN_ON(smap.length == 0)) { + written = -EIO; + goto out_src; + } + len2 = min_t(loff_t, len1, smap.length); + + ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL); + if (ret) + goto out_src; + if (WARN_ON(dmap.offset > pos2)) { + written = -EIO; + goto out_dest; + } + if (WARN_ON(dmap.length == 0)) { + written = -EIO; + goto out_dest; + } + min_len = min_t(loff_t, len2, dmap.length); + + written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap); + +out_dest: + if (ops->iomap_end) + ret2 = ops->iomap_end(ino2, pos2, len2, + written > 0 ? written : 0, flags, &dmap); +out_src: + if (ops->iomap_end) + ret = ops->iomap_end(ino1, pos1, len1, + written > 0 ? written : 0, flags, &smap); +out: + if (written) + return written; + return ret ?: ret2; +} diff --git a/include/linux/iomap.h b/include/linux/iomap.h index c87d0cb0de6d..95562f863ad0 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -150,10 +150,15 @@ struct iomap_ops { */ typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len, void *data, struct iomap *iomap, struct iomap *srcmap); - +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1, + struct inode *ino2, loff_t pos2, loff_t len, void *data, + struct iomap *smap, struct iomap *dmap); loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, const struct iomap_ops *ops, void *data, iomap_actor_t actor); +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, + loff_t pos2, loff_t length, unsigned int flags, + const struct iomap_ops *ops, void *data, iomap_actor2_t actor); ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, const struct iomap_ops *ops); -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
In the case where the iomap is a write operation and iomap is not equal to srcmap after iomap_begin, we consider it is a CoW operation. The destance extent which iomap indicated is new allocated extent. So, it is needed to copy the data from srcmap to new allocated extent. In theory, it is better to copy the head and tail ranges which is outside of the non-aligned area instead of copying the whole aligned range. But in dax page fault, it will always be an aligned range. So, we have to copy the whole range in this case. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> --- fs/dax.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 81 insertions(+), 5 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f661227b49cd..6396f091e60b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1049,6 +1049,61 @@ static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size, return rc; } +/** + * dax_iomap_cow_copy(): Copy the data from source to destination before write. + * @pos: address to do copy from. + * @length: size of copy operation. + * @align_size: aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE) + * @srcmap: iomap srcmap + * @daddr: destination address to copy to. + * + * This can be called from two places. Either during DAX write fault, to copy + * the length size data to daddr. Or, while doing normal DAX write operation, + * dax_iomap_actor() might call this to do the copy of either start or end + * unaligned address. In this case the rest of the copy of aligned ranges is + * taken care by dax_iomap_actor() itself. + * Also, note DAX fault will always result in aligned pos and pos + length. + */ +static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size, + struct iomap *srcmap, void *daddr) +{ + loff_t head_off = pos & (align_size - 1); + size_t size = ALIGN(head_off + length, align_size); + loff_t end = pos + length; + loff_t pg_end = round_up(end, align_size); + bool copy_all = head_off == 0 && end == pg_end; + void *saddr = 0; + int ret = 0; + + ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL); + if (ret) + return ret; + + if (copy_all) { + ret = copy_mc_to_kernel(daddr, saddr, length); + return ret ? -EIO : 0; + } + + /* Copy the head part of the range. Note: we pass offset as length. */ + if (head_off) { + ret = copy_mc_to_kernel(daddr, saddr, head_off); + if (ret) + return -EIO; + } + + /* Copy the tail part of the range */ + if (end < pg_end) { + loff_t tail_off = head_off + length; + loff_t tail_len = pg_end - end; + + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off, + tail_len); + if (ret) + return -EIO; + } + return 0; +} + /* * The user has performed a load from a hole in the file. Allocating a new * page in the file would cause excessive storage usage for workloads with @@ -1178,11 +1233,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct dax_device *dax_dev = iomap->dax_dev; struct iov_iter *iter = data; loff_t end = pos + length, done = 0; + bool write = iov_iter_rw(iter) == WRITE; ssize_t ret = 0; size_t xfer; int id; - if (iov_iter_rw(iter) == READ) { + if (!write) { end = min(end, i_size_read(inode)); if (pos >= end) return 0; @@ -1191,7 +1247,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, return iov_iter_zero(min(length, end - pos), iter); } - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) + /* + * In DAX mode, we allow either pure overwrites of written extents, or + * writes to unwritten extents as part of a copy-on-write operation. + */ + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED && + !(iomap->flags & IOMAP_F_SHARED))) return -EIO; /* @@ -1230,6 +1291,13 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, break; } + if (write && srcmap->addr != iomap->addr) { + ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap, + kaddr); + if (ret) + break; + } + map_len = PFN_PHYS(map_len); kaddr += offset; map_len -= offset; @@ -1241,7 +1309,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, * validated via access_ok() in either vfs_read() or * vfs_write(), depending on which operation we are doing. */ - if (iov_iter_rw(iter) == WRITE) + if (write) xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr, map_len, iter); else @@ -1393,6 +1461,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp, unsigned long entry_flags = pmd ? DAX_PMD : 0; int err = 0; pfn_t pfn; + void *kaddr; /* if we are reading UNWRITTEN and HOLE, return a hole. */ if (!write && @@ -1403,18 +1472,25 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp, return dax_pmd_load_hole(xas, vmf, iomap, entry); } - if (iomap->type != IOMAP_MAPPED) { + if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) { WARN_ON_ONCE(1); return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS; } - err = dax_iomap_direct_access(iomap, pos, size, NULL, &pfn); + err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn); if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags, write && !sync); + if (write && + srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) { + err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr); + if (err) + return dax_fault_return(err); + } + if (sync) return dax_fault_synchronous_pfnp(pfnp, pfn); -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
Punch hole on a reflinked file needs dax_copy_edge() too. Otherwise, data in not aligned area will be not correct. So, add the srcmap to dax_iomap_zero() and replace memset() as dax_copy_edge(). Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> --- fs/dax.c | 25 +++++++++++++++---------- fs/iomap/buffered-io.c | 2 +- include/linux/dax.h | 3 ++- 3 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 98531c53d613..baee584cb8ae 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1197,7 +1197,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, } #endif /* CONFIG_FS_DAX_PMD */ -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap, + struct iomap *srcmap) { sector_t sector = iomap_sector(iomap, pos & PAGE_MASK); pgoff_t pgoff; @@ -1219,19 +1220,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) if (page_aligned) rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1); - else + else { rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL); - if (rc < 0) { - dax_read_unlock(id); - return rc; - } - - if (!page_aligned) { - memset(kaddr + offset, 0, size); + if (rc < 0) + goto out; + if (iomap->addr != srcmap->addr) { + rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap, + kaddr); + if (rc < 0) + goto out; + } else + memset(kaddr + offset, 0, size); dax_flush(iomap->dax_dev, kaddr + offset, size); } + +out: dax_read_unlock(id); - return size; + return rc < 0 ? rc : size; } static loff_t diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 9023717c5188..fdaac4ba9b9d 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos, s64 bytes; if (IS_DAX(inode)) - bytes = dax_iomap_zero(pos, length, iomap); + bytes = dax_iomap_zero(pos, length, iomap, srcmap); else bytes = iomap_zero(inode, pos, length, iomap, srcmap); if (bytes < 0) diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..3275e01ed33d 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap); +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap, + struct iomap *srcmap); static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
We replace the existing entry to the newly allocated one in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this entry as writeprotected. This helps us snapshots so new write pagefaults after snapshots trigger a CoW. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> --- fs/dax.c | 39 ++++++++++++++++++++++++++++----------- 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 6396f091e60b..98531c53d613 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -733,6 +733,10 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d return 0; } +/* DAX Insert Flag: The state of the entry we insert */ +#define DAX_IF_DIRTY (1 << 0) +#define DAX_IF_COW (1 << 1) + /* * By this point grab_mapping_entry() has ensured that we have a locked entry * of the appropriate size so we don't have to worry about downgrading PMDs to @@ -740,16 +744,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d * already in the tree, we will skip the insertion and just dirty the PMD as * appropriate. */ -static void *dax_insert_entry(struct xa_state *xas, - struct address_space *mapping, struct vm_fault *vmf, - void *entry, pfn_t pfn, unsigned long flags, bool dirty) +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + void *entry, pfn_t pfn, unsigned long flags, + unsigned int insert_flags) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; void *new_entry = dax_make_entry(pfn, flags); + bool dirty = insert_flags & DAX_IF_DIRTY; + bool cow = insert_flags & DAX_IF_COW; if (dirty) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) { + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { unsigned long index = xas->xa_index; /* we are replacing a zero page with block mapping */ if (dax_is_pmd_entry(entry)) @@ -761,7 +768,7 @@ static void *dax_insert_entry(struct xa_state *xas, xas_reset(xas); xas_lock_irq(xas); - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { void *old; dax_disassociate_entry(entry, mapping, false); @@ -785,6 +792,9 @@ static void *dax_insert_entry(struct xa_state *xas, if (dirty) xas_set_mark(xas, PAGECACHE_TAG_DIRTY); + if (cow) + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irq(xas); return entry; } @@ -1120,8 +1130,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); vm_fault_t ret; - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, - DAX_ZERO_PAGE, false); + *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0); ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); trace_dax_load_hole(inode, vmf, ret); @@ -1148,8 +1157,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, goto fallback; pfn = page_to_pfn_t(zero_page); - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, - DAX_PMD | DAX_ZERO_PAGE, false); + *entry = dax_insert_entry(xas, vmf, *entry, pfn, + DAX_PMD | DAX_ZERO_PAGE, 0); if (arch_needs_pgtable_deposit()) { pgtable = pte_alloc_one(vma->vm_mm); @@ -1459,6 +1468,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp, bool write = vmf->flags & FAULT_FLAG_WRITE; bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap); unsigned long entry_flags = pmd ? DAX_PMD : 0; + unsigned int insert_flags = 0; int err = 0; pfn_t pfn; void *kaddr; @@ -1481,8 +1491,15 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp, if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags, - write && !sync); + if (write) { + if (!sync) + insert_flags |= DAX_IF_DIRTY; + if (iomap->flags & IOMAP_F_SHARED) + insert_flags |= DAX_IF_COW; + } + + *entry = dax_insert_entry(xas, vmf, *entry, pfn, entry_flags, + insert_flags); if (write && srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) { -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
This patchset is attempt to add CoW support for fsdax, and take XFS, which has both reflink and fsdax feature, as an example. Changes from V5: - Fix the lock order of xfs_inode in xfs_mmaplock_two_inodes_and_break_dax_layout() - move dax_remap_file_range_prep() to fs/dax.c - change type of length to uint64_t in dax_iomap_cow_copy() - fix mistake in dax_iomap_zero() Changes from V4: - Fix the mistake of breaking dax layout for two inodes - Add CONFIG_FS_DAX judgement for fsdax code in remap_range.c - Fix other small problems and mistakes One of the key mechanism need to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destance iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. Some of the patches are picked up from Goldwyn's patchset. I made some changes to adapt to this patchset. (Rebased on v5.13-rc2 and patchset[1]) [1]: https://lkml.org/lkml/2021/4/22/575 Shiyang Ruan (7): fsdax: Introduce dax_iomap_cow_copy() fsdax: Replace mmap entry in case of CoW fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero iomap: Introduce iomap_apply2() for operations on two files fsdax: Dedup file range to use a compare function fs/xfs: Handle CoW for fsdax write() path fs/xfs: Add dax dedupe support fs/dax.c | 216 ++++++++++++++++++++++++++++++++++++----- fs/iomap/apply.c | 52 ++++++++++ fs/iomap/buffered-io.c | 2 +- fs/remap_range.c | 36 +++++-- fs/xfs/xfs_bmap_util.c | 3 +- fs/xfs/xfs_file.c | 11 +-- fs/xfs/xfs_inode.c | 57 +++++++++++ fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_iomap.c | 38 +++++++- fs/xfs/xfs_iomap.h | 24 +++++ fs/xfs/xfs_iops.c | 7 +- fs/xfs/xfs_reflink.c | 15 +-- include/linux/dax.h | 11 ++- include/linux/fs.h | 12 ++- include/linux/iomap.h | 7 +- 15 files changed, 431 insertions(+), 61 deletions(-) -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, 2021-05-18 at 18:49 -0700, Dan Williams wrote:
> On Tue, May 18, 2021 at 6:33 PM James Bottomley <jejb@linux.ibm.com>
> wrote:
> > On Tue, 2021-05-18 at 11:24 +0100, Mark Rutland wrote:
> > > On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > >
> > > > It is unsafe to allow saving of secretmem areas to the
> > > > hibernation snapshot as they would be visible after the resume
> > > > and this essentially will defeat the purpose of secret memory
> > > > mappings.
> > > >
> > > > Prevent hibernation whenever there are active secret memory
> > > > users.
> > >
> > > Have we thought about how this is going to work in practice, e.g.
> > > on mobile systems? It seems to me that there are a variety of
> > > common applications which might want to use this which people
> > > don't expect to inhibit hibernate (e.g. authentication agents,
> > > web browsers).
> >
> > If mobile systems require hibernate, then the choice is to disable
> > this functionality or implement a secure hibernation store. I
> > also thought most mobile hibernation was basically equivalent to
> > S3, in which case there's no actual writing of ram into storage, in
> > which case there's no security barrier and likely the inhibition
> > needs to be made a bit more specific to the suspend to disk case?
> >
> > > Are we happy to say that any userspace application can
> > > incidentally inhibit hibernate?
> >
> > Well, yes, for the laptop use case because we don't want suspend to
> > disk to be able to compromise the secret area. You can disable
> > this for mobile if you like, or work out how to implement hibernate
> > securely if you're really suspending to disk.
>
> Forgive me if this was already asked and answered. Why not document
> that secretmem is ephemeral in the case of hibernate and push the
> problem to userspace to disable hibernation? In other words
> hibernation causes applications to need to reload their secretmem, it
> will be destroyed on the way down and SIGBUS afterwards. That at
> least gives a system the flexibility to either sacrifice hibernate
> for secretmem (with a userspace controlled policy), or sacrifice
> secretmem using processes for hibernate.
Well, realistically, there are many possibilities for embedded if it
wants to use secret memory. However, not really having much of an
interest in the use cases, it's not really for Mike or me to be acting
as armchair fly half. I think the best we can do is demonstrate the
system for our use cases and let embedded kick the tyres for theirs if
they care, and if not they can disable the feature.
James
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, May 18, 2021 at 05:44:22PM -0700, Andrew Morton wrote: > On Tue, 18 May 2021 10:20:31 +0300 Mike Rapoport <rppt@kernel.org> wrote: > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > Introduce "memfd_secret" system call with the ability to create memory > > areas visible only in the context of the owning process and not mapped not > > only to other processes but in the kernel page tables as well. > > > > ... > > > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -901,4 +901,9 @@ config KMAP_LOCAL > > # struct io_mapping based helper. Selected by drivers that need them > > config IO_MAPPING > > bool > > + > > +config SECRETMEM > > + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED > > + select STRICT_DEVMEM > > + > > endmenu > > WARNING: unmet direct dependencies detected for STRICT_DEVMEM > Depends on [n]: MMU [=y] && DEVMEM [=n] && (ARCH_HAS_DEVMEM_IS_ALLOWED [=y] || GENERIC_LIB_DEVMEM_IS_ALLOWED [=n]) > Selected by [y]: > - SECRETMEM [=y] > > so I went back to the v19 version, with Ouch, sorry, I forgot to remove that hunk, v19 is the correct version. > --- a/mm/Kconfig~mm-introduce-memfd_secret-system-call-to-create-secret-memory-areas-fix > +++ a/mm/Kconfig > @@ -907,6 +907,5 @@ config IO_MAPPING > > config SECRETMEM > def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED > - select STRICT_DEVMEM > > endmenu > _ > -- Sincerely yours, Mike. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, May 18, 2021 at 6:33 PM James Bottomley <jejb@linux.ibm.com> wrote:
>
> On Tue, 2021-05-18 at 11:24 +0100, Mark Rutland wrote:
> > On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this
> > > essentially will defeat the purpose of secret memory mappings.
> > >
> > > Prevent hibernation whenever there are active secret memory users.
> >
> > Have we thought about how this is going to work in practice, e.g. on
> > mobile systems? It seems to me that there are a variety of common
> > applications which might want to use this which people don't expect
> > to inhibit hibernate (e.g. authentication agents, web browsers).
>
> If mobile systems require hibernate, then the choice is to disable this
> functionality or implement a secure hibernation store. I also thought
> most mobile hibernation was basically equivalent to S3, in which case
> there's no actual writing of ram into storage, in which case there's no
> security barrier and likely the inhibition needs to be made a bit more
> specific to the suspend to disk case?
>
> > Are we happy to say that any userspace application can incidentally
> > inhibit hibernate?
>
> Well, yes, for the laptop use case because we don't want suspend to
> disk to be able to compromise the secret area. You can disable this
> for mobile if you like, or work out how to implement hibernate securely
> if you're really suspending to disk.
Forgive me if this was already asked and answered. Why not document
that secretmem is ephemeral in the case of hibernate and push the
problem to userspace to disable hibernation? In other words
hibernation causes applications to need to reload their secretmem, it
will be destroyed on the way down and SIGBUS afterwards. That at least
gives a system the flexibility to either sacrifice hibernate for
secretmem (with a userspace controlled policy), or sacrifice secretmem
using processes for hibernate.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, 2021-05-18 at 11:24 +0100, Mark Rutland wrote: > On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote: > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > It is unsafe to allow saving of secretmem areas to the hibernation > > snapshot as they would be visible after the resume and this > > essentially will defeat the purpose of secret memory mappings. > > > > Prevent hibernation whenever there are active secret memory users. > > Have we thought about how this is going to work in practice, e.g. on > mobile systems? It seems to me that there are a variety of common > applications which might want to use this which people don't expect > to inhibit hibernate (e.g. authentication agents, web browsers). If mobile systems require hibernate, then the choice is to disable this functionality or implement a secure hibernation store. I also thought most mobile hibernation was basically equivalent to S3, in which case there's no actual writing of ram into storage, in which case there's no security barrier and likely the inhibition needs to be made a bit more specific to the suspend to disk case? > Are we happy to say that any userspace application can incidentally > inhibit hibernate? Well, yes, for the laptop use case because we don't want suspend to disk to be able to compromise the secret area. You can disable this for mobile if you like, or work out how to implement hibernate securely if you're really suspending to disk. James _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, 18 May 2021 10:20:31 +0300 Mike Rapoport <rppt@kernel.org> wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
>
> ...
>
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -901,4 +901,9 @@ config KMAP_LOCAL
> # struct io_mapping based helper. Selected by drivers that need them
> config IO_MAPPING
> bool
> +
> +config SECRETMEM
> + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> + select STRICT_DEVMEM
> +
> endmenu
WARNING: unmet direct dependencies detected for STRICT_DEVMEM
Depends on [n]: MMU [=y] && DEVMEM [=n] && (ARCH_HAS_DEVMEM_IS_ALLOWED [=y] || GENERIC_LIB_DEVMEM_IS_ALLOWED [=n])
Selected by [y]:
- SECRETMEM [=y]
so I went back to the v19 version, with
--- a/mm/Kconfig~mm-introduce-memfd_secret-system-call-to-create-secret-memory-areas-fix
+++ a/mm/Kconfig
@@ -907,6 +907,5 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
- select STRICT_DEVMEM
endmenu
_
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue, May 18, 2021 at 3:26 PM Vishal Verma <vishal.l.verma@intel.com> wrote:
>
> The 'nvdimm' mailing list has moved from lists.01.org to
> lists.linux.dev. Update CONTRIBUTING.md and configure.ac to reflect
> this.
LGTM
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
The 'nvdimm' mailing list has moved from lists.01.org to lists.linux.dev. Update CONTRIBUTING.md and configure.ac to reflect this. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- configure.ac | 2 +- CONTRIBUTING.md | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index 5ec8d2f..dc39dbe 100644 --- a/configure.ac +++ b/configure.ac @@ -2,7 +2,7 @@ AC_PREREQ(2.60) m4_include([version.m4]) AC_INIT([ndctl], GIT_VERSION, - [linux-nvdimm@lists.01.org], + [nvdimm@lists.linux.dev], [ndctl], [https://github.com/pmem/ndctl]) AC_CONFIG_SRCDIR([ndctl/lib/libndctl.c]) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4c29d31..4f4865d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -6,13 +6,14 @@ The following is a set of guidelines that we adhere to, and request that contributors follow. 1. The libnvdimm (kernel subsystem) and ndctl developers primarily use - the [linux-nvdimm](https://lists.01.org/postorius/lists/linux-nvdimm.lists.01.org/) + the [nvdimm](https://subspace.kernel.org/lists.linux.dev.html) mailing list for everything. It is recommended to send patches to - **```linux-nvdimm@lists.01.org```** + **```nvdimm@lists.linux.dev```** + An archive is available on [lore](https://lore.kernel.org/nvdimm/) 1. Github [issues](https://github.com/pmem/ndctl/issues) are an acceptable way to report a problem, but if you just have a question, - [email](mailto:linux-nvdimm@lists.01.org) the above list. + [email](mailto:nvdimm@lists.linux.dev) the above list. 1. We follow the Linux Kernel [Coding Style Guide][cs] as applicable. base-commit: a2a6fda4d7e93044fca4c67870d2ff7e193d3cf1 prerequisite-patch-id: 8fc5baaf64b312b2459acea255740f79a23b76cd -- 2.31.1 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
ご注文とAmazonアカウントを調査しており、注文商品が返品されました。 [注文番号] 552731-20210514-155 [店舗受付日時] 2021/05/14 [お支払い方法] クレジットカード決済 アカウントの制限を解除するには、以下のURLを使用してください。 ご確認はこちら:https://anozam.toy-g.com Amazon.co.jpのまたのご利用をお待ちしております。 © 1996-2021, Amazon.com, Inc. or its affiliates . _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 5/18/2021 10:27 AM, Joao Martins wrote:
> On 5/5/21 11:36 PM, Joao Martins wrote:
>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>> --- a/include/linux/memremap.h
>>>>>> +++ b/include/linux/memremap.h
>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>> struct completion done;
>>>>>> enum memory_type type;
>>>>>> unsigned int flags;
>>>>>> + unsigned long align;
>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>> means "use non-compound base pages".
> [...]
>
>>>>> The non-zero value must be
>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>> Hmm, maybe it should be an
>>>>> enum:
>>>>>
>>>>> enum devmap_geometry {
>>>>> DEVMAP_PTE,
>>>>> DEVMAP_PMD,
>>>>> DEVMAP_PUD,
>>>>> }
>>>>>
>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>> value because it should be ok to have 4MB align if the device really
>>> wanted. However, when it goes to map that alignment with
>>> memremap_pages() it can pick a mode. For example, it's already the
>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>> they're already separate concepts that can stay separate.
>>>
>> Gotcha.
> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
> represents a slightly different variation of the device @align i.e. how the metadata is
> laid out **but** regardless of what kind of page table entries we use vmemmap.
>
> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
> above? -- and 3) the value of geometry actually derives from dax device @align because we
> will need to create compound pages representing a page size of @align value.
>
> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
> device @align. In reality what we want to convey in @geometry is not page table sizes, but
> just the page size used for the vmemmap of the dax device. Additionally, limiting its
> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
> create compound pages of order 10 for the said vmemmap.
>
> I am going to wait until you finish reviewing the remaining four patches of this series,
> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
> setter/getter to audit its value? Thoughts?
>
> Joao
Good points there.
My understanding is that dax->align conveys granularity of size while
carving out a namespace,
it's a geometry attribute loosely akin to sector size of a spindle
disk. I tend to think that
device pagesize has almost no relation to "align" in that, it's
possible to have 1G "align" and
4K pagesize, or verse versa. That is, with the advent of compound page
support, it is possible
to totally separate the two concepts.
How about adding a new option to "ndctl create-namespace" that describes
device creator's desired
pagesize, and another parameter to describe whether the pagesize shall
be fixed or allowed to be split up,
such that, if the intention is to never split up 2M pagesize, then it
would be possible to save a lot metadata
space on the device?
thanks,
-jane
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 5/5/21 11:36 PM, Joao Martins wrote: > On 5/5/21 11:20 PM, Dan Williams wrote: >> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote: >>> On 5/5/21 7:44 PM, Dan Williams wrote: >>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote: >>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>> index b46f63dcaed3..bb28d82dda5e 100644 >>>>> --- a/include/linux/memremap.h >>>>> +++ b/include/linux/memremap.h >>>>> @@ -114,6 +114,7 @@ struct dev_pagemap { >>>>> struct completion done; >>>>> enum memory_type type; >>>>> unsigned int flags; >>>>> + unsigned long align; >>>> >>>> I think this wants some kernel-doc above to indicate that non-zero >>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE >>>> means "use non-compound base pages". [...] >>>> The non-zero value must be >>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. >>>> Hmm, maybe it should be an >>>> enum: >>>> >>>> enum devmap_geometry { >>>> DEVMAP_PTE, >>>> DEVMAP_PMD, >>>> DEVMAP_PUD, >>>> } >>>> >>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe >>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)? >> >> I think it is ok for dax/nvdimm to continue to maintain their align >> value because it should be ok to have 4MB align if the device really >> wanted. However, when it goes to map that alignment with >> memremap_pages() it can pick a mode. For example, it's already the >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so >> they're already separate concepts that can stay separate. >> > Gotcha. I am reconsidering part of the above. In general, yes, the meaning of devmap @align represents a slightly different variation of the device @align i.e. how the metadata is laid out **but** regardless of what kind of page table entries we use vmemmap. By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE) 2) the geometry of metadata is very much tied to the value we pick to @align at namespace provisioning -- not the "align" we might use at mmap() perhaps that's what you referred above? -- and 3) the value of geometry actually derives from dax device @align because we will need to create compound pages representing a page size of @align value. Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs, in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what device @align. In reality what we want to convey in @geometry is not page table sizes, but just the page size used for the vmemmap of the dax device. Additionally, limiting its value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to create compound pages of order 10 for the said vmemmap. I am going to wait until you finish reviewing the remaining four patches of this series, but maybe this is a simple misnomer (s/align/geometry/) with a comment but without DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a setter/getter to audit its value? Thoughts? Joao _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue 18-05-21 12:35:36, David Hildenbrand wrote: > On 18.05.21 12:31, Michal Hocko wrote: > > On Tue 18-05-21 12:06:42, David Hildenbrand wrote: > > > On 18.05.21 11:59, Michal Hocko wrote: > > > > On Sun 16-05-21 10:29:24, Mike Rapoport wrote: > > > > > On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote: > > > > [...] > > > > > > > + if (!page) > > > > > > > + return VM_FAULT_OOM; > > > > > > > + > > > > > > > + err = set_direct_map_invalid_noflush(page, 1); > > > > > > > + if (err) { > > > > > > > + put_page(page); > > > > > > > + return vmf_error(err); > > > > > > > > > > > > Would we want to translate that to a proper VM_FAULT_..., which would most > > > > > > probably be VM_FAULT_OOM when we fail to allocate a pagetable? > > > > > > > > > > That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ. > > > > > > > > I haven't read through the rest but this has just caught my attention. > > > > Is it really reasonable to trigger the oom killer when you cannot > > > > invalidate the direct mapping. From a quick look at the code it is quite > > > > unlikely to se ENOMEM from that path (it allocates small pages) but this > > > > can become quite sublte over time. Shouldn't this simply SIGBUS if it > > > > cannot manipulate the direct mapping regardless of the underlying reason > > > > for that? > > > > > > > > > > OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow > > > ... > > > > Killing a userspace seems to be just a bad way around that. > > > > Although I have to say openly that I am not a great fan of VM_FAULT_OOM > > in general. It is usually a a wrong way to tell the handle the failure > > because it happens outside of the allocation context so you lose all the > > details (e.g. allocation constrains, numa policy etc.). Also whenever > > there is ENOMEM then the allocation itself has already made sure that > > all the reclaim attempts have been already depleted. Just consider an > > allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate > > ENOMEM up the call stack. Turning that into the OOM killer sounds like a > > bad idea to me. But that is a more general topic. I have tried to bring > > this up in the past but there was not much of an interest to fix it as > > it was not a pressing problem... > > > > I'm certainly interested; it would mean that we actually want to try > recovering from VM_FAULT_OOM in various cases, and as you state, we might > have to supply more information to make that work reliably. Or maybe we want to get rid of VM_FAULT_OOM altogether... But this is really tangent to this discussion. The only relation is that this would be another place to check when somebody wants to go that direction. > Having that said, I guess what we have here is just the same as when our > process fails to allocate a generic page table in __handle_mm_fault(), when > we fail p4d_alloc() and friends ... From a quick look it is really similar in a sense that it effectively never happens and if it does then it certainly does the wrong thing. The point I was trying to make is that there is likely no need to go that way. Fundamentally, not being able to handle direct map for the page fault sounds like what SIGBUS should be used for. From my POV it is similar to ENOSPC when FS cannot allocate metadata on the storage. -- Michal Hocko SUSE Labs _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 18.05.21 12:31, Michal Hocko wrote:
> On Tue 18-05-21 12:06:42, David Hildenbrand wrote:
>> On 18.05.21 11:59, Michal Hocko wrote:
>>> On Sun 16-05-21 10:29:24, Mike Rapoport wrote:
>>>> On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:
>>> [...]
>>>>>> + if (!page)
>>>>>> + return VM_FAULT_OOM;
>>>>>> +
>>>>>> + err = set_direct_map_invalid_noflush(page, 1);
>>>>>> + if (err) {
>>>>>> + put_page(page);
>>>>>> + return vmf_error(err);
>>>>>
>>>>> Would we want to translate that to a proper VM_FAULT_..., which would most
>>>>> probably be VM_FAULT_OOM when we fail to allocate a pagetable?
>>>>
>>>> That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.
>>>
>>> I haven't read through the rest but this has just caught my attention.
>>> Is it really reasonable to trigger the oom killer when you cannot
>>> invalidate the direct mapping. From a quick look at the code it is quite
>>> unlikely to se ENOMEM from that path (it allocates small pages) but this
>>> can become quite sublte over time. Shouldn't this simply SIGBUS if it
>>> cannot manipulate the direct mapping regardless of the underlying reason
>>> for that?
>>>
>>
>> OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow
>> ...
>
> Killing a userspace seems to be just a bad way around that.
>
> Although I have to say openly that I am not a great fan of VM_FAULT_OOM
> in general. It is usually a a wrong way to tell the handle the failure
> because it happens outside of the allocation context so you lose all the
> details (e.g. allocation constrains, numa policy etc.). Also whenever
> there is ENOMEM then the allocation itself has already made sure that
> all the reclaim attempts have been already depleted. Just consider an
> allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate
> ENOMEM up the call stack. Turning that into the OOM killer sounds like a
> bad idea to me. But that is a more general topic. I have tried to bring
> this up in the past but there was not much of an interest to fix it as
> it was not a pressing problem...
>
I'm certainly interested; it would mean that we actually want to try
recovering from VM_FAULT_OOM in various cases, and as you state, we
might have to supply more information to make that work reliably.
Having that said, I guess what we have here is just the same as when our
process fails to allocate a generic page table in __handle_mm_fault(),
when we fail p4d_alloc() and friends ...
--
Thanks,
David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Tue 18-05-21 12:06:42, David Hildenbrand wrote:
> On 18.05.21 11:59, Michal Hocko wrote:
> > On Sun 16-05-21 10:29:24, Mike Rapoport wrote:
> > > On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:
> > [...]
> > > > > + if (!page)
> > > > > + return VM_FAULT_OOM;
> > > > > +
> > > > > + err = set_direct_map_invalid_noflush(page, 1);
> > > > > + if (err) {
> > > > > + put_page(page);
> > > > > + return vmf_error(err);
> > > >
> > > > Would we want to translate that to a proper VM_FAULT_..., which would most
> > > > probably be VM_FAULT_OOM when we fail to allocate a pagetable?
> > >
> > > That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.
> >
> > I haven't read through the rest but this has just caught my attention.
> > Is it really reasonable to trigger the oom killer when you cannot
> > invalidate the direct mapping. From a quick look at the code it is quite
> > unlikely to se ENOMEM from that path (it allocates small pages) but this
> > can become quite sublte over time. Shouldn't this simply SIGBUS if it
> > cannot manipulate the direct mapping regardless of the underlying reason
> > for that?
> >
>
> OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow
> ...
Killing a userspace seems to be just a bad way around that.
Although I have to say openly that I am not a great fan of VM_FAULT_OOM
in general. It is usually a a wrong way to tell the handle the failure
because it happens outside of the allocation context so you lose all the
details (e.g. allocation constrains, numa policy etc.). Also whenever
there is ENOMEM then the allocation itself has already made sure that
all the reclaim attempts have been already depleted. Just consider an
allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate
ENOMEM up the call stack. Turning that into the OOM killer sounds like a
bad idea to me. But that is a more general topic. I have tried to bring
this up in the past but there was not much of an interest to fix it as
it was not a pressing problem...
--
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 18.05.21 12:24, Mark Rutland wrote:
> On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> It is unsafe to allow saving of secretmem areas to the hibernation
>> snapshot as they would be visible after the resume and this essentially
>> will defeat the purpose of secret memory mappings.
>>
>> Prevent hibernation whenever there are active secret memory users.
>
> Have we thought about how this is going to work in practice, e.g. on
> mobile systems? It seems to me that there are a variety of common
> applications which might want to use this which people don't expect to
> inhibit hibernate (e.g. authentication agents, web browsers).
>
> Are we happy to say that any userspace application can incidentally
> inhibit hibernate?
It's worth noting that secretmem has to be explicitly enabled by the
admin to even work.
--
Thanks,
David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org