* [PATCH 0/3 v5] dax: some dax fixes and cleanups @ 2015-04-07 8:33 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:33 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree Hi Andrew I finally had the time to beat up these fixes based on linux-next/akpm and it looks OK. I'm sending the two fix patches with @stable + a patch-1 for the 4.0 Kernel. The 4.1-rc Kernel will need a different patch. It is your call if you want these in stable. It is a breakage in the dax code that went into 4.0. But I guess it will not have that many users right at the get go. So feel free to remove the CC:@stable. (Also the old XIP that this DAX changed had all the same problems) [v5] * A new patch-1 Based on linux-next/akpm branch because mm/memory.c completely changed there. Also a 4.0 version of the same patch-1 if needed for stable@ List of patches: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations All these patches are based on linux-next/akpm. I'm not sure how it will interact with ext4-next though. [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP This patch is for 4.0 based tree if we decide to send [PATCH 2/3] to stable. [v4] dax: some dax fixes and cleanups * First patch fixed according to Andrew's comments. Thanks Andrew. 1st and 2nd patch can go into current Kernel as they fix something that was merged this release. * Added a new patch to fix up splice in the dax case, and cleanup. This one can wait for 4.1 (Also the first two not that anyone uses dax in production.) * DAX freeze is not fixed yet. As we have more problems then I originally hoped for, as pointed out by Dave. (Just as a referance I'm sending a NO-GOOD additional patch to show what is not good enough to do. Was the RFC of [v3]) * Not re-posting the xfstest Dave please pick this up (It already found bugs in none dax FSs) [v3] dax: Fix mmap-write not updating c/mtime * I'm re-posting the two DAX patches that fix the mmap-write after read problem with DAX. (No changes since [v2]) * I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze and making mapping read-only. Jan Please review and see if this is what you meant. [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. [v1] The main problem is that current mm/memory.c will no call us with page_mkwrite if we do not have an actual page mapping, which is what DAX uses. The solution presented here introduces a new pfn_mkwrite to solve this problem. Please see patch-2 for details. I've been running with this patch for 4 month both HW and VMs with no apparent danger, but see patch-1 I played it safe. I am also posting an xfstest 080 that demonstrate this problem, I believe that also some git operations (can't remember which) suffer from this problem. Actually Eryu Guan found that this test fails on some other FS as well. Matthew hi I would love to have your ACK on these patches? Thanks Boaz ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 0/3 v5] dax: some dax fixes and cleanups @ 2015-04-07 8:33 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:33 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree Hi Andrew I finally had the time to beat up these fixes based on linux-next/akpm and it looks OK. I'm sending the two fix patches with @stable + a patch-1 for the 4.0 Kernel. The 4.1-rc Kernel will need a different patch. It is your call if you want these in stable. It is a breakage in the dax code that went into 4.0. But I guess it will not have that many users right at the get go. So feel free to remove the CC:@stable. (Also the old XIP that this DAX changed had all the same problems) [v5] * A new patch-1 Based on linux-next/akpm branch because mm/memory.c completely changed there. Also a 4.0 version of the same patch-1 if needed for stable@ List of patches: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations All these patches are based on linux-next/akpm. I'm not sure how it will interact with ext4-next though. [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP This patch is for 4.0 based tree if we decide to send [PATCH 2/3] to stable. [v4] dax: some dax fixes and cleanups * First patch fixed according to Andrew's comments. Thanks Andrew. 1st and 2nd patch can go into current Kernel as they fix something that was merged this release. * Added a new patch to fix up splice in the dax case, and cleanup. This one can wait for 4.1 (Also the first two not that anyone uses dax in production.) * DAX freeze is not fixed yet. As we have more problems then I originally hoped for, as pointed out by Dave. (Just as a referance I'm sending a NO-GOOD additional patch to show what is not good enough to do. Was the RFC of [v3]) * Not re-posting the xfstest Dave please pick this up (It already found bugs in none dax FSs) [v3] dax: Fix mmap-write not updating c/mtime * I'm re-posting the two DAX patches that fix the mmap-write after read problem with DAX. (No changes since [v2]) * I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze and making mapping read-only. Jan Please review and see if this is what you meant. [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. [v1] The main problem is that current mm/memory.c will no call us with page_mkwrite if we do not have an actual page mapping, which is what DAX uses. The solution presented here introduces a new pfn_mkwrite to solve this problem. Please see patch-2 for details. I've been running with this patch for 4 month both HW and VMs with no apparent danger, but see patch-1 I played it safe. I am also posting an xfstest 080 that demonstrate this problem, I believe that also some git operations (can't remember which) suffer from this problem. Actually Eryu Guan found that this test fails on some other FS as well. Matthew hi I would love to have your ACK on these patches? Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:33 ` Boaz Harrosh (?) @ 2015-04-07 8:40 ` Boaz Harrosh 2015-04-07 8:51 ` Boaz Harrosh ` (3 more replies) -1 siblings, 4 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:40 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: linux-mm@kvack.org Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- include/linux/mm.h | 3 +++ mm/memory.c | 35 +++++++++++++++++++++++++++++++---- 2 files changed, 34 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..6e8f3f6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, return ret; } +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) +{ + struct vm_fault vmf = { + .page = 0, + .pgoff = (((address & PAGE_MASK) - vma->vm_start) + >> PAGE_SHIFT) + vma->vm_pgoff, + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + + return vma->vm_ops->pfn_mkwrite(vma, &vmf); +} + /* * Handle write page faults for pages that can be reused in the current vma * @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = do_pfn_mkwrite(vma, address); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, + address, &ptl); + /* Did pfn_mkwrite already fixed up the pte */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return ret; + } + } return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, old_page, 0, 0); - + } pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, orig_pte, old_page); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh @ 2015-04-07 8:51 ` Boaz Harrosh 2015-04-07 9:03 ` Kirill A. Shutemov ` (2 subsequent siblings) 3 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:51 UTC (permalink / raw) To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree On 04/07/2015 11:40 AM, Boaz Harrosh wrote: > Crap this is the wrong version I have a [v3] with some more of Kirill's comments fixes. Will resend Sorry for the noise Boaz > [v2] > Based on linux-next/akpm [3dc4623]. For v4.1 merge window > Incorporated comments from Andrew And Kirill > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: linux-mm@kvack.org > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > include/linux/mm.h | 3 +++ > mm/memory.c | 35 +++++++++++++++++++++++++++++++---- > 2 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d584b95..70c47f2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 59f6268..6e8f3f6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf = { > + .page = 0, > + .pgoff = (((address & PAGE_MASK) - vma->vm_start) > + >> PAGE_SHIFT) + vma->vm_pgoff, > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * VM_PFNMAP VMA. > * > * We should not cow pages in a shared writeable mapping. > - * Just mark the pages writable as we can't do any dirty > - * accounting on raw pfn maps. > + * Just mark the pages writable and/or call ops->pfn_mkwrite. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + int ret; > + > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, > + address, &ptl); > + /* Did pfn_mkwrite already fixed up the pte */ > + if (!pte_same(*page_table, orig_pte)) { > + pte_unmap_unlock(page_table, ptl); > + return ret; > + } > + } > return wp_page_reuse(mm, vma, address, page_table, ptl, > orig_pte, old_page, 0, 0); > - > + } > pte_unmap_unlock(page_table, ptl); > return wp_page_copy(mm, vma, address, page_table, pmd, > orig_pte, old_page); > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-07 8:51 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:51 UTC (permalink / raw) To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree On 04/07/2015 11:40 AM, Boaz Harrosh wrote: > Crap this is the wrong version I have a [v3] with some more of Kirill's comments fixes. Will resend Sorry for the noise Boaz > [v2] > Based on linux-next/akpm [3dc4623]. For v4.1 merge window > Incorporated comments from Andrew And Kirill > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: linux-mm@kvack.org > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > include/linux/mm.h | 3 +++ > mm/memory.c | 35 +++++++++++++++++++++++++++++++---- > 2 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d584b95..70c47f2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 59f6268..6e8f3f6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf = { > + .page = 0, > + .pgoff = (((address & PAGE_MASK) - vma->vm_start) > + >> PAGE_SHIFT) + vma->vm_pgoff, > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * VM_PFNMAP VMA. > * > * We should not cow pages in a shared writeable mapping. > - * Just mark the pages writable as we can't do any dirty > - * accounting on raw pfn maps. > + * Just mark the pages writable and/or call ops->pfn_mkwrite. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + int ret; > + > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, > + address, &ptl); > + /* Did pfn_mkwrite already fixed up the pte */ > + if (!pte_same(*page_table, orig_pte)) { > + pte_unmap_unlock(page_table, ptl); > + return ret; > + } > + } > return wp_page_reuse(mm, vma, address, page_table, ptl, > orig_pte, old_page, 0, 0); > - > + } > pte_unmap_unlock(page_table, ptl); > return wp_page_copy(mm, vma, address, page_table, pmd, > orig_pte, old_page); > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh 2015-04-07 8:51 ` Boaz Harrosh @ 2015-04-07 9:03 ` Kirill A. Shutemov 2015-04-07 9:22 ` Boaz Harrosh 2015-04-07 12:57 ` Boaz Harrosh 2015-04-07 14:06 ` Boaz Harrosh 3 siblings, 1 reply; 34+ messages in thread From: Kirill A. Shutemov @ 2015-04-07 9:03 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue, Apr 07, 2015 at 11:40:06AM +0300, Boaz Harrosh wrote: > > [v2] > Based on linux-next/akpm [3dc4623]. For v4.1 merge window > Incorporated comments from Andrew And Kirill Not really. You've ignored most of them. See below. > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: linux-mm@kvack.org > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > include/linux/mm.h | 3 +++ > mm/memory.c | 35 +++++++++++++++++++++++++++++++---- Please, document it in Documentation/filesystems/Locking. > 2 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d584b95..70c47f2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 59f6268..6e8f3f6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf = { > + .page = 0, .page = NULL, > + .pgoff = (((address & PAGE_MASK) - vma->vm_start) > + >> PAGE_SHIFT) + vma->vm_pgoff, .pgoff = linear_page_index(vma, address), > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * VM_PFNMAP VMA. > * > * We should not cow pages in a shared writeable mapping. > - * Just mark the pages writable as we can't do any dirty > - * accounting on raw pfn maps. > + * Just mark the pages writable and/or call ops->pfn_mkwrite. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { Let's move this case in separate function -- wp_pfn_shared(). As we do for wp_page_shared(). > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + int ret; > + > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, > + address, &ptl); > + /* Did pfn_mkwrite already fixed up the pte */ > + if (!pte_same(*page_table, orig_pte)) { > + pte_unmap_unlock(page_table, ptl); > + return ret; > + } > + } > return wp_page_reuse(mm, vma, address, page_table, ptl, > orig_pte, old_page, 0, 0); > - > + } > pte_unmap_unlock(page_table, ptl); > return wp_page_copy(mm, vma, address, page_table, pmd, > orig_pte, old_page); -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 9:03 ` Kirill A. Shutemov @ 2015-04-07 9:22 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 9:22 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On 04/07/2015 12:03 PM, Kirill A. Shutemov wrote: > On Tue, Apr 07, 2015 at 11:40:06AM +0300, Boaz Harrosh wrote: >> >> [v2] >> Based on linux-next/akpm [3dc4623]. For v4.1 merge window >> Incorporated comments from Andrew And Kirill > > Not really. You've ignored most of them. See below. > Yes sorry about that I sent the wrong version. <> >> --- >> include/linux/mm.h | 3 +++ >> mm/memory.c | 35 +++++++++++++++++++++++++++++++---- > > Please, document it in Documentation/filesystems/Locking. > Ha, I missed this one. Ok will try to put something coherent. <> >> diff --git a/mm/memory.c b/mm/memory.c >> index 59f6268..6e8f3f6 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, >> return ret; >> } >> >> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) >> +{ >> + struct vm_fault vmf = { >> + .page = 0, > > .page = NULL, > >> + .pgoff = (((address & PAGE_MASK) - vma->vm_start) >> + >> PAGE_SHIFT) + vma->vm_pgoff, > > .pgoff = linear_page_index(vma, address), > Yes I had fixes for these two >> + .virtual_address = (void __user *)(address & PAGE_MASK), >> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, >> + }; >> + >> + return vma->vm_ops->pfn_mkwrite(vma, &vmf); >> +} >> + >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, >> * VM_PFNMAP VMA. >> * >> * We should not cow pages in a shared writeable mapping. >> - * Just mark the pages writable as we can't do any dirty >> - * accounting on raw pfn maps. >> + * Just mark the pages writable and/or call ops->pfn_mkwrite. >> */ >> if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == >> - (VM_WRITE|VM_SHARED)) >> + (VM_WRITE|VM_SHARED)) { > > Let's move this case in separate function -- wp_pfn_shared(). As we do for > wp_page_shared(). > Ha, OK I will try that. I will need to re-run tests to make sure I did not mess up Thanks will fix, makes sense Boaz >> + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { >> + int ret; >> + >> + pte_unmap_unlock(page_table, ptl); >> + ret = do_pfn_mkwrite(vma, address); >> + if (ret & VM_FAULT_ERROR) >> + return ret; >> + page_table = pte_offset_map_lock(mm, pmd, >> + address, &ptl); >> + /* Did pfn_mkwrite already fixed up the pte */ >> + if (!pte_same(*page_table, orig_pte)) { >> + pte_unmap_unlock(page_table, ptl); >> + return ret; >> + } >> + } >> return wp_page_reuse(mm, vma, address, page_table, ptl, >> orig_pte, old_page, 0, 0); >> - >> + } >> pte_unmap_unlock(page_table, ptl); >> return wp_page_copy(mm, vma, address, page_table, pmd, >> orig_pte, old_page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh @ 2015-04-07 12:57 ` Boaz Harrosh 2015-04-07 9:03 ` Kirill A. Shutemov ` (2 subsequent siblings) 3 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 12:57 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree [v4] Kirill's comments about splitting out a new wp_pfn_shared(). Add Documentation/filesystems/Locking text about pfn_mkwrite. [v3] Kirill's comments about use of linear_page_index() [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: linux-mm@kvack.org Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 40 +++++++++++++++++++++++++++++++++++---- 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..25f36e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite() +already touched the pte, in that case it is untouched. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..67b1ee8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2181,6 +2181,39 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* Did pfn_mkwrite already fixed up the pte */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return ret; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, -- 1.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-07 12:57 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 12:57 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree [v4] Kirill's comments about splitting out a new wp_pfn_shared(). Add Documentation/filesystems/Locking text about pfn_mkwrite. [v3] Kirill's comments about use of linear_page_index() [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: linux-mm@kvack.org Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 40 +++++++++++++++++++++++++++++++++++---- 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..25f36e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite() +already touched the pte, in that case it is untouched. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..67b1ee8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2181,6 +2181,39 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* Did pfn_mkwrite already fixed up the pte */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return ret; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 12:57 ` Boaz Harrosh (?) @ 2015-04-07 13:17 ` Kirill A. Shutemov 2015-04-07 13:26 ` Kirill A. Shutemov -1 siblings, 1 reply; 34+ messages in thread From: Kirill A. Shutemov @ 2015-04-07 13:17 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote: > > [v4] > Kirill's comments about splitting out a new wp_pfn_shared(). > Add Documentation/filesystems/Locking text about pfn_mkwrite. > > [v3] > Kirill's comments about use of linear_page_index() > > [v2] > Based on linux-next/akpm [3dc4623]. For v4.1 merge window > Incorporated comments from Andrew And Kirill > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: linux-mm@kvack.org > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > Documentation/filesystems/Locking | 8 ++++++++ > include/linux/mm.h | 3 +++ > mm/memory.c | 40 +++++++++++++++++++++++++++++++++++---- > 3 files changed, 47 insertions(+), 4 deletions(-) > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index f91926f..25f36e6 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -525,6 +525,7 @@ prototypes: > void (*close)(struct vm_area_struct*); > int (*fault)(struct vm_area_struct*, struct vm_fault *); > int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); > + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); > int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); > > locking rules: > @@ -534,6 +535,7 @@ close: yes > fault: yes can return with page locked > map_pages: yes > page_mkwrite: yes can return with page locked > +pfn_mkwrite: yes > access: yes > > ->fault() is called when a previously not present pte is about > @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page > like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which > will cause the VM to retry the fault. > > + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is > +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is > +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior > +after this call is to make the pte read-write, unless pfn_mkwrite() > +already touched the pte, in that case it is untouched. > + > ->access() is called when get_user_pages() fails in > access_process_vm(), typically used to debug a process through > /proc/pid/mem or ptrace. This function is needed only for > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d584b95..70c47f2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 59f6268..67b1ee8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2181,6 +2181,39 @@ oom: > return VM_FAULT_OOM; > } > > +/* > + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED > + * mapping > + */ > +static int wp_pfn_shared(struct mm_struct *mm, > + struct vm_area_struct *vma, unsigned long address, > + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, > + pmd_t *pmd) > +{ > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + struct vm_fault vmf = { > + .page = NULL, > + .pgoff = linear_page_index(vma, address), > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + int ret; > + > + pte_unmap_unlock(page_table, ptl); > + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > + /* Did pfn_mkwrite already fixed up the pte */ > + if (!pte_same(*page_table, orig_pte)) { > + pte_unmap_unlock(page_table, ptl); > + return ret; This should be "return 0;", shouldn't it? VM_FAULT_NOPAGE would imply you've installed new pte, but you did not. > + } > + } > + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, > + NULL, 0, 0); > +} > + > static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pte_t *page_table, > pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, > @@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * VM_PFNMAP VMA. > * > * We should not cow pages in a shared writeable mapping. > - * Just mark the pages writable as we can't do any dirty > - * accounting on raw pfn maps. > + * Just mark the pages writable and/or call ops->pfn_mkwrite. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > (VM_WRITE|VM_SHARED)) > - return wp_page_reuse(mm, vma, address, page_table, ptl, > - orig_pte, old_page, 0, 0); > + return wp_pfn_shared(mm, vma, address, page_table, ptl, > + orig_pte, pmd); > > pte_unmap_unlock(page_table, ptl); > return wp_page_copy(mm, vma, address, page_table, pmd, > -- > 1.9.3 > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 13:17 ` Kirill A. Shutemov @ 2015-04-07 13:26 ` Kirill A. Shutemov 2015-04-07 13:37 ` Boaz Harrosh 0 siblings, 1 reply; 34+ messages in thread From: Kirill A. Shutemov @ 2015-04-07 13:26 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote: > On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote: > > +/* > > + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED > > + * mapping > > + */ > > +static int wp_pfn_shared(struct mm_struct *mm, > > + struct vm_area_struct *vma, unsigned long address, > > + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, > > + pmd_t *pmd) > > +{ > > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > > + struct vm_fault vmf = { > > + .page = NULL, > > + .pgoff = linear_page_index(vma, address), > > + .virtual_address = (void __user *)(address & PAGE_MASK), > > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > > + }; > > + int ret; > > + > > + pte_unmap_unlock(page_table, ptl); > > + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); > > + if (ret & VM_FAULT_ERROR) > > + return ret; > > + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > + /* Did pfn_mkwrite already fixed up the pte */ Oh. I guess you've missunderstood why we need pte_same() check below. It's not about ->pfn_mkwrite() changing the pte (generatlly, it should not). It's requited to address race with parallel page fault to the pte. > > + if (!pte_same(*page_table, orig_pte)) { > > + pte_unmap_unlock(page_table, ptl); > > + return ret; > > This should be "return 0;", shouldn't it? > > VM_FAULT_NOPAGE would imply you've installed new pte, but you did not. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 13:26 ` Kirill A. Shutemov @ 2015-04-07 13:37 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 13:37 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote: > On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote: >> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote: >>> +/* >>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED >>> + * mapping >>> + */ >>> +static int wp_pfn_shared(struct mm_struct *mm, >>> + struct vm_area_struct *vma, unsigned long address, >>> + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, >>> + pmd_t *pmd) >>> +{ >>> + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { >>> + struct vm_fault vmf = { >>> + .page = NULL, >>> + .pgoff = linear_page_index(vma, address), >>> + .virtual_address = (void __user *)(address & PAGE_MASK), >>> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, >>> + }; >>> + int ret; >>> + >>> + pte_unmap_unlock(page_table, ptl); >>> + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); >>> + if (ret & VM_FAULT_ERROR) >>> + return ret; >>> + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); >>> + /* Did pfn_mkwrite already fixed up the pte */ > > Oh. I guess you've missunderstood why we need pte_same() check below. > It's not about ->pfn_mkwrite() changing the pte (generatlly, it should > not). It's requited to address race with parallel page fault to the pte. > >>> + if (!pte_same(*page_table, orig_pte)) { >>> + pte_unmap_unlock(page_table, ptl); >>> + return ret; >> >> This should be "return 0;", shouldn't it? >> >> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not. Changing this to "return 0" would be very scary for me. Because I'm running with this code for 1/2 a year now. And it is stable. You see since the original code it was always doing just that pte_unmap_unlock && return ret. (See the patch based on 4.0) I did not understand if you want that I keep it "return ret". I gather that you would like the comment changed, about the changed pte. Both here and at Documentation/.../locking. I'll send a new patch just tell me if you want the reurn thing Thank you Boaz ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-07 13:37 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 13:37 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote: > On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote: >> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote: >>> +/* >>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED >>> + * mapping >>> + */ >>> +static int wp_pfn_shared(struct mm_struct *mm, >>> + struct vm_area_struct *vma, unsigned long address, >>> + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, >>> + pmd_t *pmd) >>> +{ >>> + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { >>> + struct vm_fault vmf = { >>> + .page = NULL, >>> + .pgoff = linear_page_index(vma, address), >>> + .virtual_address = (void __user *)(address & PAGE_MASK), >>> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, >>> + }; >>> + int ret; >>> + >>> + pte_unmap_unlock(page_table, ptl); >>> + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); >>> + if (ret & VM_FAULT_ERROR) >>> + return ret; >>> + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); >>> + /* Did pfn_mkwrite already fixed up the pte */ > > Oh. I guess you've missunderstood why we need pte_same() check below. > It's not about ->pfn_mkwrite() changing the pte (generatlly, it should > not). It's requited to address race with parallel page fault to the pte. > >>> + if (!pte_same(*page_table, orig_pte)) { >>> + pte_unmap_unlock(page_table, ptl); >>> + return ret; >> >> This should be "return 0;", shouldn't it? >> >> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not. Changing this to "return 0" would be very scary for me. Because I'm running with this code for 1/2 a year now. And it is stable. You see since the original code it was always doing just that pte_unmap_unlock && return ret. (See the patch based on 4.0) I did not understand if you want that I keep it "return ret". I gather that you would like the comment changed, about the changed pte. Both here and at Documentation/.../locking. I'll send a new patch just tell me if you want the reurn thing Thank you Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 13:37 ` Boaz Harrosh (?) @ 2015-04-07 13:47 ` Kirill A. Shutemov 2015-04-07 14:09 ` Boaz Harrosh -1 siblings, 1 reply; 34+ messages in thread From: Kirill A. Shutemov @ 2015-04-07 13:47 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue, Apr 07, 2015 at 04:37:07PM +0300, Boaz Harrosh wrote: > On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote: > > On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote: > >> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote: > >>> +/* > >>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED > >>> + * mapping > >>> + */ > >>> +static int wp_pfn_shared(struct mm_struct *mm, > >>> + struct vm_area_struct *vma, unsigned long address, > >>> + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, > >>> + pmd_t *pmd) > >>> +{ > >>> + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > >>> + struct vm_fault vmf = { > >>> + .page = NULL, > >>> + .pgoff = linear_page_index(vma, address), > >>> + .virtual_address = (void __user *)(address & PAGE_MASK), > >>> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > >>> + }; > >>> + int ret; > >>> + > >>> + pte_unmap_unlock(page_table, ptl); > >>> + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); > >>> + if (ret & VM_FAULT_ERROR) > >>> + return ret; > >>> + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > >>> + /* Did pfn_mkwrite already fixed up the pte */ > > > > Oh. I guess you've missunderstood why we need pte_same() check below. > > It's not about ->pfn_mkwrite() changing the pte (generatlly, it should > > not). It's requited to address race with parallel page fault to the pte. > > > >>> + if (!pte_same(*page_table, orig_pte)) { > >>> + pte_unmap_unlock(page_table, ptl); > >>> + return ret; > >> > >> This should be "return 0;", shouldn't it? > >> > >> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not. > > Changing this to "return 0" would be very scary for me. Because I'm running > with this code for 1/2 a year now. And it is stable. You see since the original > code it was always doing just that pte_unmap_unlock && return ret. (See the patch > based on 4.0) > > I did not understand if you want that I keep it "return ret". I think "return 0;" is right way to go. It matches wp_page_shared() behaviour. > I gather that you would like the comment changed, about the changed pte. > Both here and at Documentation/.../locking. > > I'll send a new patch just tell me if you want the reurn thing > > Thank you > Boaz > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 13:47 ` Kirill A. Shutemov @ 2015-04-07 14:09 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 14:09 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On 04/07/2015 04:47 PM, Kirill A. Shutemov wrote: <> >> I did not understand if you want that I keep it "return ret". > > I think "return 0;" is right way to go. It matches wp_page_shared() > behaviour. > Ok I sent a v7 with "return 0;" as is in wp_page_shared(). I guess it is better. It survived of course an xfstests quick but I'll let it smoke for the night as well. Thanks for all your help Boaz ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-07 14:09 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 14:09 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On 04/07/2015 04:47 PM, Kirill A. Shutemov wrote: <> >> I did not understand if you want that I keep it "return ret". > > I think "return 0;" is right way to go. It matches wp_page_shared() > behaviour. > Ok I sent a v7 with "return 0;" as is in wp_page_shared(). I guess it is better. It survived of course an xfstests quick but I'll let it smoke for the night as well. Thanks for all your help Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh @ 2015-04-07 14:06 ` Boaz Harrosh 2015-04-07 9:03 ` Kirill A. Shutemov ` (2 subsequent siblings) 3 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 14:06 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree [v5] Changed comments about pte_same check after the call to pfn_mkwrite and the return value. [v4] Kirill's comments about splitting out a new wp_pfn_shared(). Add Documentation/filesystems/Locking text about pfn_mkwrite. [v3] Kirill's comments about use of linear_page_index() [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: linux-mm@kvack.org Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 43 +++++++++++++++++++++++++++++++++++---- 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..8bb8a7e 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite returns +an error. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..d839cbc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2181,6 +2181,42 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* + * We might have raced with another page fault while we + * released the pte_offset_map_lock. + */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return 0; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2295,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, -- 1.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-07 14:06 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 14:06 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree [v5] Changed comments about pte_same check after the call to pfn_mkwrite and the return value. [v4] Kirill's comments about splitting out a new wp_pfn_shared(). Add Documentation/filesystems/Locking text about pfn_mkwrite. [v3] Kirill's comments about use of linear_page_index() [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: linux-mm@kvack.org Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 43 +++++++++++++++++++++++++++++++++++---- 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..8bb8a7e 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite returns +an error. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..d839cbc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2181,6 +2181,42 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* + * We might have raced with another page fault while we + * released the pte_offset_map_lock. + */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return 0; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2295,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 14:06 ` Boaz Harrosh (?) @ 2015-04-07 14:12 ` Kirill A. Shutemov -1 siblings, 0 replies; 34+ messages in thread From: Kirill A. Shutemov @ 2015-04-07 14:12 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue, Apr 07, 2015 at 05:06:11PM +0300, Boaz Harrosh wrote: > > [v5] > Changed comments about pte_same check after the call to > pfn_mkwrite and the return value. > > [v4] > Kirill's comments about splitting out a new wp_pfn_shared(). > Add Documentation/filesystems/Locking text about pfn_mkwrite. > > [v3] > Kirill's comments about use of linear_page_index() > > [v2] > Based on linux-next/akpm [3dc4623]. For v4.1 merge window > Incorporated comments from Andrew And Kirill > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: linux-mm@kvack.org > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection 2015-04-07 8:33 ` Boaz Harrosh @ 2015-04-07 8:43 ` Boaz Harrosh -1 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:43 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree From: Yigal Korman <yigal@plexistor.com> [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. CC: Jan Kara <jack@suse.cz> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Dave Chinner <david@fromorbit.com> CC: Stable Tree <stable@vger.kernel.org> Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- fs/dax.c | 17 +++++++++++++++++ fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + include/linux/fs.h | 1 + 4 files changed, 20 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index ed1619e..d0bd1f4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** + * dax_pfn_mkwrite - handle first write to DAX page + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * + */ +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + sb_end_pagefault(sb); + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); + +/** * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e317017..866a3ce 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext2_dax_vm_ops = { .fault = ext2_dax_fault, .page_mkwrite = ext2_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 598abbb..aa78c70 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .page_mkwrite = ext4_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops diff --git a/include/linux/fs.h b/include/linux/fs.h index 368e349..394035f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) #ifdef CONFIG_BLOCK -- 1.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection @ 2015-04-07 8:43 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:43 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree From: Yigal Korman <yigal@plexistor.com> [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. CC: Jan Kara <jack@suse.cz> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Dave Chinner <david@fromorbit.com> CC: Stable Tree <stable@vger.kernel.org> Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- fs/dax.c | 17 +++++++++++++++++ fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + include/linux/fs.h | 1 + 4 files changed, 20 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index ed1619e..d0bd1f4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** + * dax_pfn_mkwrite - handle first write to DAX page + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * + */ +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + sb_end_pagefault(sb); + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); + +/** * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e317017..866a3ce 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext2_dax_vm_ops = { .fault = ext2_dax_fault, .page_mkwrite = ext2_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 598abbb..aa78c70 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .page_mkwrite = ext4_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops diff --git a/include/linux/fs.h b/include/linux/fs.h index 368e349..394035f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) #ifdef CONFIG_BLOCK -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection 2015-04-07 8:43 ` Boaz Harrosh @ 2015-04-07 16:28 ` Jan Kara -1 siblings, 0 replies; 34+ messages in thread From: Jan Kara @ 2015-04-07 16:28 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue 07-04-15 11:43:50, Boaz Harrosh wrote: > From: Yigal Korman <yigal@plexistor.com> > > [v1] > Without this patch, c/mtime is not updated correctly when mmap'ed page is > first read from and then written to. > > A new xfstest is submitted for testing this (generic/080) > > [v2] > Jan Kara has pointed out that if we add the > sb_start/end_pagefault pair in the new pfn_mkwrite we > are then fixing another bug where: A user could start > writing to the page while filesystem is frozen. The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > CC: Jan Kara <jack@suse.cz> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Dave Chinner <david@fromorbit.com> > CC: Stable Tree <stable@vger.kernel.org> > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > fs/dax.c | 17 +++++++++++++++++ > fs/ext2/file.c | 1 + > fs/ext4/file.c | 1 + > include/linux/fs.h | 1 + > 4 files changed, 20 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index ed1619e..d0bd1f4 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > EXPORT_SYMBOL_GPL(dax_fault); > > /** > + * dax_pfn_mkwrite - handle first write to DAX page > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * > + */ > +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + sb_end_pagefault(sb); > + return VM_FAULT_NOPAGE; > +} > +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); > + > +/** > * dax_zero_page_range - zero a range within a page of a DAX file > * @inode: The file being truncated > * @from: The file offset that is being truncated to > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index e317017..866a3ce 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > static const struct vm_operations_struct ext2_dax_vm_ops = { > .fault = ext2_dax_fault, > .page_mkwrite = ext2_dax_mkwrite, > + .pfn_mkwrite = dax_pfn_mkwrite, > }; > > static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 598abbb..aa78c70 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > static const struct vm_operations_struct ext4_dax_vm_ops = { > .fault = ext4_dax_fault, > .page_mkwrite = ext4_dax_mkwrite, > + .pfn_mkwrite = dax_pfn_mkwrite, > }; > #else > #define ext4_dax_vm_ops ext4_file_vm_ops > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 368e349..394035f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); > int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); > +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); > #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) > > #ifdef CONFIG_BLOCK > -- > 1.9.3 > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection @ 2015-04-07 16:28 ` Jan Kara 0 siblings, 0 replies; 34+ messages in thread From: Jan Kara @ 2015-04-07 16:28 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue 07-04-15 11:43:50, Boaz Harrosh wrote: > From: Yigal Korman <yigal@plexistor.com> > > [v1] > Without this patch, c/mtime is not updated correctly when mmap'ed page is > first read from and then written to. > > A new xfstest is submitted for testing this (generic/080) > > [v2] > Jan Kara has pointed out that if we add the > sb_start/end_pagefault pair in the new pfn_mkwrite we > are then fixing another bug where: A user could start > writing to the page while filesystem is frozen. The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > CC: Jan Kara <jack@suse.cz> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Dave Chinner <david@fromorbit.com> > CC: Stable Tree <stable@vger.kernel.org> > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > --- > fs/dax.c | 17 +++++++++++++++++ > fs/ext2/file.c | 1 + > fs/ext4/file.c | 1 + > include/linux/fs.h | 1 + > 4 files changed, 20 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index ed1619e..d0bd1f4 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > EXPORT_SYMBOL_GPL(dax_fault); > > /** > + * dax_pfn_mkwrite - handle first write to DAX page > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * > + */ > +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + sb_end_pagefault(sb); > + return VM_FAULT_NOPAGE; > +} > +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); > + > +/** > * dax_zero_page_range - zero a range within a page of a DAX file > * @inode: The file being truncated > * @from: The file offset that is being truncated to > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index e317017..866a3ce 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > static const struct vm_operations_struct ext2_dax_vm_ops = { > .fault = ext2_dax_fault, > .page_mkwrite = ext2_dax_mkwrite, > + .pfn_mkwrite = dax_pfn_mkwrite, > }; > > static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 598abbb..aa78c70 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > static const struct vm_operations_struct ext4_dax_vm_ops = { > .fault = ext4_dax_fault, > .page_mkwrite = ext4_dax_mkwrite, > + .pfn_mkwrite = dax_pfn_mkwrite, > }; > #else > #define ext4_dax_vm_ops ext4_file_vm_ops > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 368e349..394035f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); > int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); > +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); > #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) > > #ifdef CONFIG_BLOCK > -- > 1.9.3 > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations 2015-04-07 8:33 ` Boaz Harrosh @ 2015-04-07 8:45 ` Boaz Harrosh -1 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:45 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree The original dax patchset split the ext2/4_file_operations because of the two NULL splice_read/splice_write in the dax case. At vfs if splice_read/splice_write are NULL would then call default_splice_read/write. What we do here is make generic_file_splice_read aware of IS_DAX() so the original ext2/4_file_operations can be used as is. For write it appears that iter_file_splice_write is just fine. It uses the regular f_op->write(file,..) or new_sync_write(file, ...). CC: Dave Chinner <dchinner@redhat.com> CC: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- fs/ext2/ext2.h | 1 - fs/ext2/file.c | 18 ------------------ fs/ext2/inode.c | 5 +---- fs/ext2/namei.c | 10 ++-------- fs/ext4/ext4.h | 1 - fs/ext4/file.c | 20 -------------------- fs/ext4/inode.c | 5 +---- fs/ext4/namei.c | 10 ++-------- fs/splice.c | 3 +++ 9 files changed, 9 insertions(+), 64 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 678f9ab..8d15feb 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 866a3ce..19cac93 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = { .splice_write = iter_file_splice_write, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext2_dax_file_operations = { - .llseek = generic_file_llseek, - .read = new_sync_read, - .write = new_sync_write, - .read_iter = generic_file_read_iter, - .write_iter = generic_file_write_iter, - .unlocked_ioctl = ext2_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext2_compat_ioctl, -#endif - .mmap = ext2_file_mmap, - .open = dquot_file_open, - .release = ext2_release_file, - .fsync = ext2_fsync, -}; -#endif - const struct inode_operations ext2_file_inode_operations = { #ifdef CONFIG_EXT2_FS_XATTR .setxattr = generic_setxattr, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index df9d6af..b29eb67 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 148f6e3..ce42293 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { @@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f63c3d5..8a3981e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; -extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); /* inline.c */ diff --git a/fs/ext4/file.c b/fs/ext4/file.c index aa78c70..e6d4280 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext4_dax_file_operations = { - .llseek = ext4_llseek, - .read = new_sync_read, - .write = new_sync_write, - .read_iter = generic_file_read_iter, - .write_iter = ext4_file_write_iter, - .unlocked_ioctl = ext4_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext4_compat_ioctl, -#endif - .mmap = ext4_file_mmap, - .open = ext4_file_open, - .release = ext4_release_file, - .fsync = ext4_sync_file, - /* Splice not yet supported with DAX */ - .fallocate = ext4_fallocate, -}; -#endif - const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a3f4513..035b7a0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 28fe71a..2291923 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2235,10 +2235,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2302,10 +2299,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/splice.c b/fs/splice.c index 41cbb16..476024b 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, loff_t isize, left; int ret; + if (IS_DAX(in->f_mapping->host)) + return default_file_splice_read(in, ppos, pipe, len, flags); + isize = i_size_read(in->f_mapping->host); if (unlikely(*ppos >= isize)) return 0; -- 1.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations @ 2015-04-07 8:45 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-07 8:45 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig Cc: Stable Tree The original dax patchset split the ext2/4_file_operations because of the two NULL splice_read/splice_write in the dax case. At vfs if splice_read/splice_write are NULL would then call default_splice_read/write. What we do here is make generic_file_splice_read aware of IS_DAX() so the original ext2/4_file_operations can be used as is. For write it appears that iter_file_splice_write is just fine. It uses the regular f_op->write(file,..) or new_sync_write(file, ...). CC: Dave Chinner <dchinner@redhat.com> CC: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- fs/ext2/ext2.h | 1 - fs/ext2/file.c | 18 ------------------ fs/ext2/inode.c | 5 +---- fs/ext2/namei.c | 10 ++-------- fs/ext4/ext4.h | 1 - fs/ext4/file.c | 20 -------------------- fs/ext4/inode.c | 5 +---- fs/ext4/namei.c | 10 ++-------- fs/splice.c | 3 +++ 9 files changed, 9 insertions(+), 64 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 678f9ab..8d15feb 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 866a3ce..19cac93 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = { .splice_write = iter_file_splice_write, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext2_dax_file_operations = { - .llseek = generic_file_llseek, - .read = new_sync_read, - .write = new_sync_write, - .read_iter = generic_file_read_iter, - .write_iter = generic_file_write_iter, - .unlocked_ioctl = ext2_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext2_compat_ioctl, -#endif - .mmap = ext2_file_mmap, - .open = dquot_file_open, - .release = ext2_release_file, - .fsync = ext2_fsync, -}; -#endif - const struct inode_operations ext2_file_inode_operations = { #ifdef CONFIG_EXT2_FS_XATTR .setxattr = generic_setxattr, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index df9d6af..b29eb67 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 148f6e3..ce42293 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { @@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f63c3d5..8a3981e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; -extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); /* inline.c */ diff --git a/fs/ext4/file.c b/fs/ext4/file.c index aa78c70..e6d4280 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext4_dax_file_operations = { - .llseek = ext4_llseek, - .read = new_sync_read, - .write = new_sync_write, - .read_iter = generic_file_read_iter, - .write_iter = ext4_file_write_iter, - .unlocked_ioctl = ext4_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext4_compat_ioctl, -#endif - .mmap = ext4_file_mmap, - .open = ext4_file_open, - .release = ext4_release_file, - .fsync = ext4_sync_file, - /* Splice not yet supported with DAX */ - .fallocate = ext4_fallocate, -}; -#endif - const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a3f4513..035b7a0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 28fe71a..2291923 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2235,10 +2235,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2302,10 +2299,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/splice.c b/fs/splice.c index 41cbb16..476024b 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, loff_t isize, left; int ret; + if (IS_DAX(in->f_mapping->host)) + return default_file_splice_read(in, ppos, pipe, len, flags); + isize = i_size_read(in->f_mapping->host); if (unlikely(*ppos >= isize)) return 0; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations 2015-04-07 8:45 ` Boaz Harrosh (?) @ 2015-04-07 16:26 ` Jan Kara -1 siblings, 0 replies; 34+ messages in thread From: Jan Kara @ 2015-04-07 16:26 UTC (permalink / raw) To: Boaz Harrosh Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree On Tue 07-04-15 11:45:09, Boaz Harrosh wrote: > > The original dax patchset split the ext2/4_file_operations > because of the two NULL splice_read/splice_write in the dax > case. > > At vfs if splice_read/splice_write are NULL would then call > default_splice_read/write. > > What we do here is make generic_file_splice_read aware of > IS_DAX() so the original ext2/4_file_operations can be used > as is. > > For write it appears that iter_file_splice_write is just fine. > It uses the regular f_op->write(file,..) or new_sync_write(file, ...). > > CC: Dave Chinner <dchinner@redhat.com> > CC: Matthew Wilcox <willy@linux.intel.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext2/ext2.h | 1 - > fs/ext2/file.c | 18 ------------------ > fs/ext2/inode.c | 5 +---- > fs/ext2/namei.c | 10 ++-------- > fs/ext4/ext4.h | 1 - > fs/ext4/file.c | 20 -------------------- > fs/ext4/inode.c | 5 +---- > fs/ext4/namei.c | 10 ++-------- > fs/splice.c | 3 +++ > 9 files changed, 9 insertions(+), 64 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 678f9ab..8d15feb 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, > int datasync); > extern const struct inode_operations ext2_file_inode_operations; > extern const struct file_operations ext2_file_operations; > -extern const struct file_operations ext2_dax_file_operations; > > /* inode.c */ > extern const struct address_space_operations ext2_aops; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 866a3ce..19cac93 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = { > .splice_write = iter_file_splice_write, > }; > > -#ifdef CONFIG_FS_DAX > -const struct file_operations ext2_dax_file_operations = { > - .llseek = generic_file_llseek, > - .read = new_sync_read, > - .write = new_sync_write, > - .read_iter = generic_file_read_iter, > - .write_iter = generic_file_write_iter, > - .unlocked_ioctl = ext2_ioctl, > -#ifdef CONFIG_COMPAT > - .compat_ioctl = ext2_compat_ioctl, > -#endif > - .mmap = ext2_file_mmap, > - .open = dquot_file_open, > - .release = ext2_release_file, > - .fsync = ext2_fsync, > -}; > -#endif > - > const struct inode_operations ext2_file_inode_operations = { > #ifdef CONFIG_EXT2_FS_XATTR > .setxattr = generic_setxattr, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index df9d6af..b29eb67 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) { > - inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_dax_file_operations; > - } else if (test_opt(inode->i_sb, NOBH)) { > + if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > } else { > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 148f6e3..ce42293 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) { > - inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_dax_file_operations; > - } else if (test_opt(inode->i_sb, NOBH)) { > + if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > } else { > @@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) { > - inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_dax_file_operations; > - } else if (test_opt(inode->i_sb, NOBH)) { > + if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > } else { > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index f63c3d5..8a3981e 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations; > /* file.c */ > extern const struct inode_operations ext4_file_inode_operations; > extern const struct file_operations ext4_file_operations; > -extern const struct file_operations ext4_dax_file_operations; > extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); > > /* inline.c */ > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index aa78c70..e6d4280 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = { > .fallocate = ext4_fallocate, > }; > > -#ifdef CONFIG_FS_DAX > -const struct file_operations ext4_dax_file_operations = { > - .llseek = ext4_llseek, > - .read = new_sync_read, > - .write = new_sync_write, > - .read_iter = generic_file_read_iter, > - .write_iter = ext4_file_write_iter, > - .unlocked_ioctl = ext4_ioctl, > -#ifdef CONFIG_COMPAT > - .compat_ioctl = ext4_compat_ioctl, > -#endif > - .mmap = ext4_file_mmap, > - .open = ext4_file_open, > - .release = ext4_release_file, > - .fsync = ext4_sync_file, > - /* Splice not yet supported with DAX */ > - .fallocate = ext4_fallocate, > -}; > -#endif > - > const struct inode_operations ext4_file_inode_operations = { > .setattr = ext4_setattr, > .getattr = ext4_getattr, > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index a3f4513..035b7a0 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext4_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) > - inode->i_fop = &ext4_dax_file_operations; > - else > - inode->i_fop = &ext4_file_operations; > + inode->i_fop = &ext4_file_operations; > ext4_set_aops(inode); > } else if (S_ISDIR(inode->i_mode)) { > inode->i_op = &ext4_dir_inode_operations; > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 28fe71a..2291923 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -2235,10 +2235,7 @@ retry: > err = PTR_ERR(inode); > if (!IS_ERR(inode)) { > inode->i_op = &ext4_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) > - inode->i_fop = &ext4_dax_file_operations; > - else > - inode->i_fop = &ext4_file_operations; > + inode->i_fop = &ext4_file_operations; > ext4_set_aops(inode); > err = ext4_add_nondir(handle, dentry, inode); > if (!err && IS_DIRSYNC(dir)) > @@ -2302,10 +2299,7 @@ retry: > err = PTR_ERR(inode); > if (!IS_ERR(inode)) { > inode->i_op = &ext4_file_inode_operations; > - if (test_opt(inode->i_sb, DAX)) > - inode->i_fop = &ext4_dax_file_operations; > - else > - inode->i_fop = &ext4_file_operations; > + inode->i_fop = &ext4_file_operations; > ext4_set_aops(inode); > d_tmpfile(dentry, inode); > err = ext4_orphan_add(handle, inode); > diff --git a/fs/splice.c b/fs/splice.c > index 41cbb16..476024b 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, > loff_t isize, left; > int ret; > > + if (IS_DAX(in->f_mapping->host)) > + return default_file_splice_read(in, ppos, pipe, len, flags); > + > isize = i_size_read(in->f_mapping->host); > if (unlikely(*ppos >= isize)) > return 0; > -- > 1.9.3 > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-07 8:33 ` Boaz Harrosh @ 2015-04-08 15:56 ` Boaz Harrosh -1 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-08 15:56 UTC (permalink / raw) To: Stable Tree Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig From: Yigal Korman <yigal@plexistor.com> [For Stable 4.0.X] The parallel patch at 4.1-rc1 to this patch is: Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP We need this patch for the 4.0.X stable tree if the patch Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection Was decided to be pulled into stable since it is a dependency of this patch. The file mm/memory.c was heavily changed in 4.1 hence this here. [v3] In the case of !pte_same when we lost the race better return 0 instead of FAULT_NO_PAGE [v2] Fixed according to Kirill's comments [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: Konstantin Khlebnikov <koct9i@gmail.com> CC: linux-mm@kvack.org CC: Stable Tree <stable@vger.kernel.org> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 27 ++++++++++++++++++++++++++- 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..25f36e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite() +already touched the pte, in that case it is untouched. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index 47a9392..85ba9c2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 97839f5..6029777 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, return ret; } +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) +{ + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + + return vma->vm_ops->pfn_mkwrite(vma, &vmf); +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * accounting on raw pfn maps. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + pte_unmap_unlock(page_table, ptl); + ret = do_pfn_mkwrite(vma, address); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, + address, &ptl); + if (!pte_same(*page_table, orig_pte)) { + ret = 0; + goto unlock; + } + } goto reuse; + } goto gotten; } -- 1.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-08 15:56 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-08 15:56 UTC (permalink / raw) To: Stable Tree Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig From: Yigal Korman <yigal@plexistor.com> [For Stable 4.0.X] The parallel patch at 4.1-rc1 to this patch is: Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP We need this patch for the 4.0.X stable tree if the patch Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection Was decided to be pulled into stable since it is a dependency of this patch. The file mm/memory.c was heavily changed in 4.1 hence this here. [v3] In the case of !pte_same when we lost the race better return 0 instead of FAULT_NO_PAGE [v2] Fixed according to Kirill's comments [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org> CC: Hugh Dickins <hughd@google.com> CC: Mel Gorman <mgorman@suse.de> CC: Konstantin Khlebnikov <koct9i@gmail.com> CC: linux-mm@kvack.org CC: Stable Tree <stable@vger.kernel.org> --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 27 ++++++++++++++++++++++++++- 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..25f36e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite() +already touched the pte, in that case it is untouched. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index 47a9392..85ba9c2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 97839f5..6029777 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, return ret; } +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) +{ + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + + return vma->vm_ops->pfn_mkwrite(vma, &vmf); +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * accounting on raw pfn maps. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + pte_unmap_unlock(page_table, ptl); + ret = do_pfn_mkwrite(vma, address); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, + address, &ptl); + if (!pte_same(*page_table, orig_pte)) { + ret = 0; + goto unlock; + } + } goto reuse; + } goto gotten; } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-08 15:56 ` Boaz Harrosh @ 2015-04-08 16:00 ` Boaz Harrosh -1 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-08 16:00 UTC (permalink / raw) To: Stable Tree Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig On 04/08/2015 06:56 PM, Boaz Harrosh wrote: > From: Yigal Korman <yigal@plexistor.com> > > [For Stable 4.0.X] > The parallel patch at 4.1-rc1 to this patch is: > Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP > > We need this patch for the 4.0.X stable tree if the patch > Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection > > Was decided to be pulled into stable since it is a dependency > of this patch. The file mm/memory.c was heavily changed in 4.1 > hence this here. > I forgot to send this patch for the stables tree, 4.0 only. Again this one is only needed if we are truing to pull Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection Which has the Stable@ tag. The problem it fixes is minor and might be skipped if causes problems. Thanks Boaz > [v3] > In the case of !pte_same when we lost the race better > return 0 instead of FAULT_NO_PAGE > > [v2] > Fixed according to Kirill's comments > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: Konstantin Khlebnikov <koct9i@gmail.com> > CC: linux-mm@kvack.org > CC: Stable Tree <stable@vger.kernel.org> > --- > Documentation/filesystems/Locking | 8 ++++++++ > include/linux/mm.h | 3 +++ > mm/memory.c | 27 ++++++++++++++++++++++++++- > 3 files changed, 37 insertions(+), 1 deletion(-) > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index f91926f..25f36e6 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -525,6 +525,7 @@ prototypes: > void (*close)(struct vm_area_struct*); > int (*fault)(struct vm_area_struct*, struct vm_fault *); > int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); > + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); > int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); > > locking rules: > @@ -534,6 +535,7 @@ close: yes > fault: yes can return with page locked > map_pages: yes > page_mkwrite: yes can return with page locked > +pfn_mkwrite: yes > access: yes > > ->fault() is called when a previously not present pte is about > @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page > like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which > will cause the VM to retry the fault. > > + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is > +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is > +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior > +after this call is to make the pte read-write, unless pfn_mkwrite() > +already touched the pte, in that case it is untouched. > + > ->access() is called when get_user_pages() fails in > access_process_vm(), typically used to debug a process through > /proc/pid/mem or ptrace. This function is needed only for > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 47a9392..85ba9c2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 97839f5..6029777 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf = { > + .page = NULL, > + .pgoff = linear_page_index(vma, address), > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} > + > /* > * This routine handles present pages, when users try to write > * to a shared page. It is done by copying the page to a new address > @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * accounting on raw pfn maps. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, > + address, &ptl); > + if (!pte_same(*page_table, orig_pte)) { > + ret = 0; > + goto unlock; > + } > + } > goto reuse; > + } > goto gotten; > } > > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-08 16:00 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-08 16:00 UTC (permalink / raw) To: Stable Tree Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig On 04/08/2015 06:56 PM, Boaz Harrosh wrote: > From: Yigal Korman <yigal@plexistor.com> > > [For Stable 4.0.X] > The parallel patch at 4.1-rc1 to this patch is: > Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP > > We need this patch for the 4.0.X stable tree if the patch > Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection > > Was decided to be pulled into stable since it is a dependency > of this patch. The file mm/memory.c was heavily changed in 4.1 > hence this here. > I forgot to send this patch for the stables tree, 4.0 only. Again this one is only needed if we are truing to pull Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection Which has the Stable@ tag. The problem it fixes is minor and might be skipped if causes problems. Thanks Boaz > [v3] > In the case of !pte_same when we lost the race better > return 0 instead of FAULT_NO_PAGE > > [v2] > Fixed according to Kirill's comments > > [v1] > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. > > Signed-off-by: Yigal Korman <yigal@plexistor.com> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com> > CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com> > CC: Jan Kara <jack@suse.cz> > CC: Andrew Morton <akpm@linux-foundation.org> > CC: Hugh Dickins <hughd@google.com> > CC: Mel Gorman <mgorman@suse.de> > CC: Konstantin Khlebnikov <koct9i@gmail.com> > CC: linux-mm@kvack.org > CC: Stable Tree <stable@vger.kernel.org> > --- > Documentation/filesystems/Locking | 8 ++++++++ > include/linux/mm.h | 3 +++ > mm/memory.c | 27 ++++++++++++++++++++++++++- > 3 files changed, 37 insertions(+), 1 deletion(-) > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index f91926f..25f36e6 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -525,6 +525,7 @@ prototypes: > void (*close)(struct vm_area_struct*); > int (*fault)(struct vm_area_struct*, struct vm_fault *); > int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); > + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); > int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); > > locking rules: > @@ -534,6 +535,7 @@ close: yes > fault: yes can return with page locked > map_pages: yes > page_mkwrite: yes can return with page locked > +pfn_mkwrite: yes > access: yes > > ->fault() is called when a previously not present pte is about > @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page > like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which > will cause the VM to retry the fault. > > + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is > +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is > +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior > +after this call is to make the pte read-write, unless pfn_mkwrite() > +already touched the pte, in that case it is untouched. > + > ->access() is called when get_user_pages() fails in > access_process_vm(), typically used to debug a process through > /proc/pid/mem or ptrace. This function is needed only for > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 47a9392..85ba9c2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -251,6 +251,9 @@ struct vm_operations_struct { > * writable, if an error is returned it will cause a SIGBUS */ > int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > > + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ > + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); > + > /* called by access_process_vm when get_user_pages() fails, typically > * for use by special VMAs that can switch between memory and hardware > */ > diff --git a/mm/memory.c b/mm/memory.c > index 97839f5..6029777 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf = { > + .page = NULL, > + .pgoff = linear_page_index(vma, address), > + .virtual_address = (void __user *)(address & PAGE_MASK), > + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, > + }; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} > + > /* > * This routine handles present pages, when users try to write > * to a shared page. It is done by copying the page to a new address > @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * accounting on raw pfn maps. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { > + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, > + address, &ptl); > + if (!pte_same(*page_table, orig_pte)) { > + ret = 0; > + goto unlock; > + } > + } > goto reuse; > + } > goto gotten; > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-08 16:00 ` Boaz Harrosh (?) @ 2015-04-08 20:26 ` Greg KH 2015-04-12 7:49 ` Boaz Harrosh -1 siblings, 1 reply; 34+ messages in thread From: Greg KH @ 2015-04-08 20:26 UTC (permalink / raw) To: Boaz Harrosh Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote: > On 04/08/2015 06:56 PM, Boaz Harrosh wrote: > > From: Yigal Korman <yigal@plexistor.com> > > > > [For Stable 4.0.X] > > The parallel patch at 4.1-rc1 to this patch is: > > Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP > > > > We need this patch for the 4.0.X stable tree if the patch > > Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection > > > > Was decided to be pulled into stable since it is a dependency > > of this patch. The file mm/memory.c was heavily changed in 4.1 > > hence this here. > > > > I forgot to send this patch for the stables tree, 4.0 only. > > Again this one is only needed if we are truing to pull > Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection > > Which has the Stable@ tag. The problem it fixes is minor and might > be skipped if causes problems. I can't take patches in the stable tree that are not in Linus's tree also. Why can't I just take a corrisponding patch that is already in Linus's tree, why do we need something "special" here? thanks, greg k-h -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP 2015-04-08 20:26 ` Greg KH @ 2015-04-12 7:49 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-12 7:49 UTC (permalink / raw) To: Greg KH Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig On 04/08/2015 11:26 PM, Greg KH wrote: > On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote: >> On 04/08/2015 06:56 PM, Boaz Harrosh wrote: >>> From: Yigal Korman <yigal@plexistor.com> >>> >>> [For Stable 4.0.X] >>> The parallel patch at 4.1-rc1 to this patch is: >>> Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP >>> >>> We need this patch for the 4.0.X stable tree if the patch >>> Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection >>> >>> Was decided to be pulled into stable since it is a dependency >>> of this patch. The file mm/memory.c was heavily changed in 4.1 >>> hence this here. >>> >> >> I forgot to send this patch for the stables tree, 4.0 only. >> >> Again this one is only needed if we are truing to pull >> Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection >> >> Which has the Stable@ tag. The problem it fixes is minor and might >> be skipped if causes problems. > > I can't take patches in the stable tree that are not in Linus's tree > also. Why can't I just take a corrisponding patch that is already in > Linus's tree, why do we need something "special" here? > > thanks, > Hi greg Yes sorry I did not explain very well. the akpm tree in linux-next as two patches: 8dca515 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP dac1bd2 dax: use pfn_mkwrite to update c/mtime + freeze protection Now these patches will hit Linus tree in 4.1 merge window. The second patch is tagged with stable@ CC, because it fixes DAX which was introduced in 4.0. It depends on the 1st patch. However the first patch is not tagged stable@ because it will not apply at all to 4.0. This is because it patches mm/memory.c which will completely change in 4.1. This is why I sent this special patch which has the same exact title, and does exactly the same as the 1st patch but on the 4.0 Kernel. So when you encounter this 2nd patch with the Stable@ tag. I think the best is to just ignore it, and wait for complains, which will most probably not come because DAX is pretty experimental. (But if we do pull it we will need this here) > greg k-h > Thanks Boaz ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP @ 2015-04-12 7:49 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-04-12 7:49 UTC (permalink / raw) To: Greg KH Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig On 04/08/2015 11:26 PM, Greg KH wrote: > On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote: >> On 04/08/2015 06:56 PM, Boaz Harrosh wrote: >>> From: Yigal Korman <yigal@plexistor.com> >>> >>> [For Stable 4.0.X] >>> The parallel patch at 4.1-rc1 to this patch is: >>> Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP >>> >>> We need this patch for the 4.0.X stable tree if the patch >>> Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection >>> >>> Was decided to be pulled into stable since it is a dependency >>> of this patch. The file mm/memory.c was heavily changed in 4.1 >>> hence this here. >>> >> >> I forgot to send this patch for the stables tree, 4.0 only. >> >> Again this one is only needed if we are truing to pull >> Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection >> >> Which has the Stable@ tag. The problem it fixes is minor and might >> be skipped if causes problems. > > I can't take patches in the stable tree that are not in Linus's tree > also. Why can't I just take a corrisponding patch that is already in > Linus's tree, why do we need something "special" here? > > thanks, > Hi greg Yes sorry I did not explain very well. the akpm tree in linux-next as two patches: 8dca515 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP dac1bd2 dax: use pfn_mkwrite to update c/mtime + freeze protection Now these patches will hit Linus tree in 4.1 merge window. The second patch is tagged with stable@ CC, because it fixes DAX which was introduced in 4.0. It depends on the 1st patch. However the first patch is not tagged stable@ because it will not apply at all to 4.0. This is because it patches mm/memory.c which will completely change in 4.1. This is why I sent this special patch which has the same exact title, and does exactly the same as the 1st patch but on the 4.0 Kernel. So when you encounter this 2nd patch with the Stable@ tag. I think the best is to just ignore it, and wait for complains, which will most probably not come because DAX is pretty experimental. (But if we do pull it we will need this here) > greg k-h > Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime @ 2015-03-23 12:47 Boaz Harrosh 2015-03-23 12:52 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh 0 siblings, 1 reply; 34+ messages in thread From: Boaz Harrosh @ 2015-03-23 12:47 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan Hi [v3] * I'm re-posting the two DAX patches that fix the mmap-write after read problem with DAX. (No changes since [v2]) * I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze and making mapping read-only. Jan Please review and see if this is what you meant. [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. [v1] The main problem is that current mm/memory.c will no call us with page_mkwrite if we do not have an actual page mapping, which is what DAX uses. The solution presented here introduces a new pfn_mkwrite to solve this problem. Please see patch-2 for details. I've been running with this patch for 4 month both HW and VMs with no apparent danger, but see patch-1 I played it safe. I am also posting an xfstest 080 that demonstrate this problem, I believe that also some git operations (can't remember which) suffer from this problem. Actually Eryu Guan found that this test fails on some other FS as well. List of patches: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze [PATCH 3/3] RFC: dax: dax_prepare_freeze [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime Please I need that some mm person review the first patch? Andrew hi I believe this needs to eventually go through your tree. Please pick it up when you feel it is ready. I believe the first 2 are ready and fix real bugs. Matthew hi I would love to have your ACK on these patches? Thanks Boaz ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection 2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh @ 2015-03-23 12:52 ` Boaz Harrosh 0 siblings, 0 replies; 34+ messages in thread From: Boaz Harrosh @ 2015-03-23 12:52 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel, Eryu Guan From: Yigal Korman <yigal@plexistor.com> [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. CC: Jan Kara <jack@suse.cz> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- fs/dax.c | 17 +++++++++++++++++ fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + include/linux/fs.h | 1 + 4 files changed, 20 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index ed1619e..d0bd1f4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** + * dax_pfn_mkwrite - handle first write to DAX page + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * + */ +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + sb_end_pagefault(sb); + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); + +/** * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e317017..866a3ce 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext2_dax_vm_ops = { .fault = ext2_dax_fault, .page_mkwrite = ext2_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 33a09da..b43a7a6 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .page_mkwrite = ext4_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops diff --git a/include/linux/fs.h b/include/linux/fs.h index b4d71b5..24af817 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2597,6 +2597,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) #ifdef CONFIG_BLOCK -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
end of thread, other threads:[~2015-04-12 7:49 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-07 8:33 [PATCH 0/3 v5] dax: some dax fixes and cleanups Boaz Harrosh 2015-04-07 8:33 ` Boaz Harrosh 2015-04-07 8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh 2015-04-07 8:51 ` Boaz Harrosh 2015-04-07 8:51 ` Boaz Harrosh 2015-04-07 9:03 ` Kirill A. Shutemov 2015-04-07 9:22 ` Boaz Harrosh 2015-04-07 12:57 ` [PATCH 1/3 v6] " Boaz Harrosh 2015-04-07 12:57 ` Boaz Harrosh 2015-04-07 13:17 ` Kirill A. Shutemov 2015-04-07 13:26 ` Kirill A. Shutemov 2015-04-07 13:37 ` Boaz Harrosh 2015-04-07 13:37 ` Boaz Harrosh 2015-04-07 13:47 ` Kirill A. Shutemov 2015-04-07 14:09 ` Boaz Harrosh 2015-04-07 14:09 ` Boaz Harrosh 2015-04-07 14:06 ` [PATCH 1/3 v7] " Boaz Harrosh 2015-04-07 14:06 ` Boaz Harrosh 2015-04-07 14:12 ` Kirill A. Shutemov 2015-04-07 8:43 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh 2015-04-07 8:43 ` Boaz Harrosh 2015-04-07 16:28 ` Jan Kara 2015-04-07 16:28 ` Jan Kara 2015-04-07 8:45 ` [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations Boaz Harrosh 2015-04-07 8:45 ` Boaz Harrosh 2015-04-07 16:26 ` Jan Kara 2015-04-08 15:56 ` [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh 2015-04-08 15:56 ` Boaz Harrosh 2015-04-08 16:00 ` Boaz Harrosh 2015-04-08 16:00 ` Boaz Harrosh 2015-04-08 20:26 ` Greg KH 2015-04-12 7:49 ` Boaz Harrosh 2015-04-12 7:49 ` Boaz Harrosh -- strict thread matches above, loose matches on Subject: below -- 2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh 2015-03-23 12:52 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.