From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f177.google.com (mail-we0-f177.google.com [74.125.82.177]) by kanga.kvack.org (Postfix) with ESMTP id 153126B0038 for ; Mon, 23 Mar 2015 08:47:57 -0400 (EDT) Received: by wegp1 with SMTP id p1so136804509weg.1 for ; Mon, 23 Mar 2015 05:47:56 -0700 (PDT) Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com. [74.125.82.52]) by mx.google.com with ESMTPS id s5si11595201wik.109.2015.03.23.05.47.55 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 05:47:55 -0700 (PDT) Received: by wgra20 with SMTP id a20so144997563wgr.3 for ; Mon, 23 Mar 2015 05:47:54 -0700 (PDT) Message-ID: <55100B78.501@plexistor.com> Date: Mon, 23 Mar 2015 14:47:52 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan Hi [v3] * I'm re-posting the two DAX patches that fix the mmap-write after read problem with DAX. (No changes since [v2]) * I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze and making mapping read-only. Jan Please review and see if this is what you meant. [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. [v1] The main problem is that current mm/memory.c will no call us with page_mkwrite if we do not have an actual page mapping, which is what DAX uses. The solution presented here introduces a new pfn_mkwrite to solve this problem. Please see patch-2 for details. I've been running with this patch for 4 month both HW and VMs with no apparent danger, but see patch-1 I played it safe. I am also posting an xfstest 080 that demonstrate this problem, I believe that also some git operations (can't remember which) suffer from this problem. Actually Eryu Guan found that this test fails on some other FS as well. List of patches: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze [PATCH 3/3] RFC: dax: dax_prepare_freeze [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime Please I need that some mm person review the first patch? Andrew hi I believe this needs to eventually go through your tree. Please pick it up when you feel it is ready. I believe the first 2 are ready and fix real bugs. Matthew hi I would love to have your ACK on these patches? Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f177.google.com (mail-we0-f177.google.com [74.125.82.177]) by kanga.kvack.org (Postfix) with ESMTP id 60BEE6B006C for ; Mon, 23 Mar 2015 08:49:37 -0400 (EDT) Received: by weop45 with SMTP id p45so136900494weo.0 for ; Mon, 23 Mar 2015 05:49:37 -0700 (PDT) Received: from mail-we0-f174.google.com (mail-we0-f174.google.com. [74.125.82.174]) by mx.google.com with ESMTPS id fx10si2389964wib.118.2015.03.23.05.49.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 05:49:36 -0700 (PDT) Received: by wegp1 with SMTP id p1so136841830weg.1 for ; Mon, 23 Mar 2015 05:49:35 -0700 (PDT) Message-ID: <55100BDC.7000901@plexistor.com> Date: Mon, 23 Mar 2015 14:49:32 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP References: <55100B78.501@plexistor.com> In-Reply-To: <55100B78.501@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan From: Yigal Korman This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Matthew Wilcox CC: Kirill A. Shutemov CC: Jan Kara CC: Andrew Morton CC: Hugh Dickins CC: Mel Gorman CC: linux-mm@kvack.org Signed-off-by: Yigal Korman Signed-off-by: Boaz Harrosh --- include/linux/mm.h | 2 ++ mm/memory.c | 27 ++++++++++++++++++++++++++- 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 47a9392..1cd820c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -250,6 +250,8 @@ struct vm_operations_struct { /* notification that a previously read-only page is about to become * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware diff --git a/mm/memory.c b/mm/memory.c index 8068893..ebef8f6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, return ret; } +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) +{ + struct vm_fault vmf; + + if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite) + return 0; + + vmf.page = 0; + vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + + vma->vm_pgoff; + vmf.virtual_address = (void __user *)(address & PAGE_MASK); + vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; + + return vma->vm_ops->pfn_mkwrite(vma, &vmf); +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * accounting on raw pfn maps. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + pte_unmap_unlock(page_table, ptl); + ret = do_pfn_mkwrite(vma, address); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, + &ptl); + if (!pte_same(*page_table, orig_pte)) + goto unlock; goto reuse; + } goto gotten; } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 4C5FB6B006E for ; Mon, 23 Mar 2015 08:52:40 -0400 (EDT) Received: by wgs2 with SMTP id 2so38189539wgs.1 for ; Mon, 23 Mar 2015 05:52:39 -0700 (PDT) Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com. [74.125.82.54]) by mx.google.com with ESMTPS id um2si928581wjc.201.2015.03.23.05.52.38 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 05:52:38 -0700 (PDT) Received: by wgra20 with SMTP id a20so145101215wgr.3 for ; Mon, 23 Mar 2015 05:52:38 -0700 (PDT) Message-ID: <55100C93.9040607@plexistor.com> Date: Mon, 23 Mar 2015 14:52:35 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection References: <55100B78.501@plexistor.com> In-Reply-To: <55100B78.501@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan From: Yigal Korman [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. CC: Jan Kara CC: Matthew Wilcox CC: Andrew Morton Signed-off-by: Yigal Korman Signed-off-by: Boaz Harrosh --- fs/dax.c | 17 +++++++++++++++++ fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + include/linux/fs.h | 1 + 4 files changed, 20 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index ed1619e..d0bd1f4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** + * dax_pfn_mkwrite - handle first write to DAX page + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * + */ +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + sb_end_pagefault(sb); + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); + +/** * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e317017..866a3ce 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext2_dax_vm_ops = { .fault = ext2_dax_fault, .page_mkwrite = ext2_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 33a09da..b43a7a6 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .page_mkwrite = ext4_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops diff --git a/include/linux/fs.h b/include/linux/fs.h index b4d71b5..24af817 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2597,6 +2597,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) #ifdef CONFIG_BLOCK -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id C12AD6B0071 for ; Mon, 23 Mar 2015 08:54:44 -0400 (EDT) Received: by wixw10 with SMTP id w10so61604519wix.0 for ; Mon, 23 Mar 2015 05:54:44 -0700 (PDT) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com. [74.125.82.182]) by mx.google.com with ESMTPS id ln3si11639410wic.72.2015.03.23.05.54.42 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 05:54:43 -0700 (PDT) Received: by wetk59 with SMTP id k59so137038003wet.3 for ; Mon, 23 Mar 2015 05:54:42 -0700 (PDT) Message-ID: <55100D10.6090902@plexistor.com> Date: Mon, 23 Mar 2015 14:54:40 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> In-Reply-To: <55100B78.501@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan From: Boaz Harrosh When freezing an FS, we must write protect all IS_DAX() inodes that have an mmap mapping on an inode. Otherwise application will be able to modify previously faulted-in file pages. I'm actually doing a full unmap_mapping_range because there is no readily available "mapping_write_protect" like functionality. I do not think it is worth it to define one just for here and just for some extra read-faults after an fs_freeze. How hot-path is fs_freeze at all? CC: Jan Kara CC: Matthew Wilcox CC: Andrew Morton Signed-off-by: Boaz Harrosh --- fs/dax.c | 30 ++++++++++++++++++++++++++++++ fs/super.c | 3 +++ include/linux/fs.h | 1 + 3 files changed, 34 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index d0bd1f4..f3fc28b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +/* This is meant to be called as part of freeze_super. otherwise we might + * Need some extra locking before calling here. + */ +void dax_prepare_freeze(struct super_block *sb) +{ + struct inode *inode; + + /* TODO: each DAX fs has some private mount option to enable DAX. If + * We made that option a generic MS_DAX_ENABLE super_block flag we could + * Avoid the 95% extra unneeded loop-on-all-inodes every freeze. + * if (!(sb->s_flags & MS_DAX_ENABLE)) + * return 0; + */ + + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + /* TODO: For freezing we can actually do with write-protecting + * the page. But I cannot find a ready made function that does + * that for a giving mapping (with all the proper locking). + * How performance sensitive is the all sb_freeze API? + * For now we can just unmap the all mapping, and pay extra + * on read faults. + */ + /* NOTE: Do not unmap private COW mapped pages it will not + * modify the FS. + */ + if (IS_DAX(inode)) + unmap_mapping_range(inode->i_mapping, 0, 0, 0); + } +} diff --git a/fs/super.c b/fs/super.c index 2b7dc90..9ef490c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb) /* All writers are done so after syncing there won't be dirty data */ sync_filesystem(sb); + /* Need to take care of DAX mmaped inodes */ + dax_prepare_freeze(sb); + /* Now wait for internal filesystem counter */ sb->s_writers.frozen = SB_FREEZE_FS; smp_wmb(); diff --git a/include/linux/fs.h b/include/linux/fs.h index 24af817..3b943d4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) +void dax_prepare_freeze(struct super_block *sb); #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id AEC686B0071 for ; Mon, 23 Mar 2015 08:56:48 -0400 (EDT) Received: by wixw10 with SMTP id w10so34342102wix.0 for ; Mon, 23 Mar 2015 05:56:48 -0700 (PDT) Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com. [74.125.82.41]) by mx.google.com with ESMTPS id k8si11645830wiy.84.2015.03.23.05.56.46 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 05:56:47 -0700 (PDT) Received: by wgra20 with SMTP id a20so145196530wgr.3 for ; Mon, 23 Mar 2015 05:56:46 -0700 (PDT) Message-ID: <55100D8B.90409@plexistor.com> Date: Mon, 23 Mar 2015 14:56:43 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime References: <55100B78.501@plexistor.com> In-Reply-To: <55100B78.501@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan From: Dave Chinner when using mmap() for file i/o, writing to the file should update it's c/mtime. Specifically if we first mmap-read from a page, then memap-write to the same page. This test was failing for the initial submission of DAX because pfn based mapping do not have an page_mkwrite called for them. The new Kernel patches that introduce pfn_mkwrite fixes this test. Written by Dave Chinner but edited and tested by: Omer Zilberberg Dave hands-up man, it looks like you edited this directly in the email, but there was not even a single typo. Tested-by: Omer Zilberberg Tested-by: Boaz Harrosh Signed-off-by: Omer Zilberberg Signed-off-by: Boaz Harrosh Reviewed-by: Eryu Guan --- tests/generic/080 | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/080.out | 2 ++ tests/generic/group | 1 + 3 files changed, 81 insertions(+) create mode 100755 tests/generic/080 create mode 100644 tests/generic/080.out diff --git a/tests/generic/080 b/tests/generic/080 new file mode 100755 index 0000000..43c93d7 --- /dev/null +++ b/tests/generic/080 @@ -0,0 +1,78 @@ +#! /bin/bash +# FS QA Test No. 080 +# +# Verify that mtime is updated when writing to mmap-ed pages +# +#----------------------------------------------------------------------- +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=0 +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* + rm -f $testfile +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs generic +_supported_os IRIX Linux +_require_test + +echo "Silence is golden." +rm -f $seqres.full + +# pattern the file. +testfile=$TEST_DIR/mmap_mtime_testfile +$XFS_IO_PROG -f -c "pwrite 0 4k" -c fsync $testfile >> $seqres.full + +# sample timestamps. +mtime1=`stat -c %Y $testfile` +ctime1=`stat -c %Z $testfile` +echo "before mwrite: $mtime1 $ctime1" >> $seqres.full + +# map read followed by map write to trigger timestamp change +sleep 2 +$XFS_IO_PROG -c "mmap 0 4k" -c "mread 0 4k" -c "mwrite 0 4k" $testfile \ + >> $seqres.full + +# sample and verify that timestamps have changed. +mtime2=`stat -c %Y $testfile` +ctime2=`stat -c %Z $testfile` +echo "after mwrite : $mtime2 $ctime2" >> $seqres.full + +if [ "$mtime1" == "$mtime2" ]; then + echo "mtime not updated" + let status=$status+1 +fi +if [ "$ctime1" == "$ctime2" ]; then + echo "ctime not updated" + let status=$status+1 +fi + +exit diff --git a/tests/generic/080.out b/tests/generic/080.out new file mode 100644 index 0000000..cccac52 --- /dev/null +++ b/tests/generic/080.out @@ -0,0 +1,2 @@ +QA output created by 080 +Silence is golden. diff --git a/tests/generic/group b/tests/generic/group index d56d3ce..8154401 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -79,6 +79,7 @@ 077 acl attr auto enospc 078 auto quick metadata 079 acl attr ioctl metadata auto quick +080 auto quick 083 rw auto enospc stress 088 perms auto quick 089 metadata auto -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id B80E46B006C for ; Mon, 23 Mar 2015 18:40:52 -0400 (EDT) Received: by pabxg6 with SMTP id xg6so192417235pab.0 for ; Mon, 23 Mar 2015 15:40:52 -0700 (PDT) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net. [150.101.137.145]) by mx.google.com with ESMTP id o7si2816048pdp.136.2015.03.23.15.40.50 for ; Mon, 23 Mar 2015 15:40:52 -0700 (PDT) Date: Tue, 24 Mar 2015 09:40:47 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150323224047.GQ28621@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55100D10.6090902@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote: > From: Boaz Harrosh > > When freezing an FS, we must write protect all IS_DAX() > inodes that have an mmap mapping on an inode. Otherwise > application will be able to modify previously faulted-in > file pages. All you need to do is lock out page faults once the page is clean; that's what the sb_start_pagefault() calls are for in the page fault path - they catch write faults and block them until the filesystem is unfrozen. Hence I'm not sure why this would be necessary if you are catching write faults in .pfn_mkwrite.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f175.google.com (mail-pd0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id BE4BB6B0038 for ; Mon, 23 Mar 2015 18:49:05 -0400 (EDT) Received: by pdbni2 with SMTP id ni2so201292548pdb.1 for ; Mon, 23 Mar 2015 15:49:05 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id in5si2670542pbd.231.2015.03.23.15.49.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 15:49:04 -0700 (PDT) Date: Mon, 23 Mar 2015 15:49:03 -0700 From: Andrew Morton Subject: Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Message-Id: <20150323154903.5f5263095a4f7eff59bc9bb8@linux-foundation.org> In-Reply-To: <55100BDC.7000901@plexistor.com> References: <55100B78.501@plexistor.com> <55100BDC.7000901@plexistor.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Dave Chinner , Matthew Wilcox , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Mon, 23 Mar 2015 14:49:32 +0200 Boaz Harrosh wrote: > From: Yigal Korman > > This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) > to get notified when access is a write to a read-only PFN. > > This can happen if we mmap() a file then first mmap-read from it > to page-in a read-only PFN, than we mmap-write to the same page. > > We need this functionality to fix a DAX bug, where in the scenario > above we fail to set ctime/mtime though we modified the file. > An xfstest is attached to this patchset that shows the failure > and the fix. (A DAX patch will follow) > > This functionality is extra important for us, because upon > dirtying of a pmem page we also want to RDMA the page to a > remote cluster node. > > We define a new pfn_mkwrite and do not reuse page_mkwrite because > 1 - The name ;-) > 2 - But mainly because it would take a very long and tedious > audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP > users. To make sure they do not now CRASH. For example current > DAX code (which this is for) would crash. > If we would want to reuse page_mkwrite, We will need to first > patch all users, so to not-crash-on-no-page. Then enable this > patch. But even if I did that I would not sleep so well at night. > Adding a new vector is the safest thing to do, and is not that > expensive. an extra pointer at a static function vector per driver. > Also the new vector is better for performance, because else we > Will call all current Kernel vectors, so to: > check-ha-no-page-do-nothing and return. > > No need to call it from do_shared_fault because do_wp_page is called to > change pte permissions anyway. Looks OK to me. > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > return ret; > } > > +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address) > +{ > + struct vm_fault vmf; > + > + if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite) > + return 0; > + > + vmf.page = 0; > + vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + > + vma->vm_pgoff; > + vmf.virtual_address = (void __user *)(address & PAGE_MASK); > + vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; > + > + return vma->vm_ops->pfn_mkwrite(vma, &vmf); > +} It might be a little neater to use if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { struct vm_fault vmf = { ... }; ... } > /* > * This routine handles present pages, when users try to write > * to a shared page. It is done by copying the page to a new address > @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > * accounting on raw pfn maps. > */ > if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > - (VM_WRITE|VM_SHARED)) > + (VM_WRITE|VM_SHARED)) { > + pte_unmap_unlock(page_table, ptl); > + ret = do_pfn_mkwrite(vma, address); > + if (ret & VM_FAULT_ERROR) > + return ret; > + page_table = pte_offset_map_lock(mm, pmd, address, > + &ptl); > + if (!pte_same(*page_table, orig_pte)) > + goto unlock; > goto reuse; > + } > goto gotten; There are significant pending changes in this area. See linux-next, or http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f175.google.com (mail-we0-f175.google.com [74.125.82.175]) by kanga.kvack.org (Postfix) with ESMTP id 989976B0038 for ; Tue, 24 Mar 2015 02:15:03 -0400 (EDT) Received: by wegp1 with SMTP id p1so155027552weg.1 for ; Mon, 23 Mar 2015 23:15:03 -0700 (PDT) Received: from mail-we0-f177.google.com (mail-we0-f177.google.com. [74.125.82.177]) by mx.google.com with ESMTPS id w1si15461363wix.3.2015.03.23.23.15.01 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Mar 2015 23:15:01 -0700 (PDT) Received: by weop45 with SMTP id p45so155097677weo.0 for ; Mon, 23 Mar 2015 23:15:01 -0700 (PDT) Message-ID: <551100E3.9010007@plexistor.com> Date: Tue, 24 Mar 2015 08:14:59 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> In-Reply-To: <20150323224047.GQ28621@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/24/2015 12:40 AM, Dave Chinner wrote: > On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote: >> From: Boaz Harrosh >> >> When freezing an FS, we must write protect all IS_DAX() >> inodes that have an mmap mapping on an inode. Otherwise >> application will be able to modify previously faulted-in >> file pages. > > All you need to do is lock out page faults once the page is clean; > that's what the sb_start_pagefault() calls are for in the page fault > path - they catch write faults and block them until the filesystem > is unfrozen. Hence I'm not sure why this would be necessary if you > are catching write faults in .pfn_mkwrite.... > Jan pointed it out and he was right I have a test for this. What happens is that since we had a mapping from before the freeze we will not have a page-fault. And the buffers will be modified. As Jan explained in the cache path we do a writeback which turns all pages to read-only. But with dax we do not have writeback so the pages stay read-write mapped. Something needs to loop through the pages and write-protect them. I chose to unmap them because it is the much-much smaller code, and I do not care to optimize the freeze. [Yes, sigh, I will convert the test to an xfstest. May I just add it to some existing fs_freeze test. Only novelty is that we need to write-access an mmap block before the freeze-start, then continue access after the freeze and see modifications ] > Cheers, > Dave. > Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by kanga.kvack.org (Postfix) with ESMTP id 1C3696B0038 for ; Tue, 24 Mar 2015 08:37:50 -0400 (EDT) Received: by weoy45 with SMTP id y45so12496418weo.2 for ; Tue, 24 Mar 2015 05:37:49 -0700 (PDT) Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com. [74.125.82.54]) by mx.google.com with ESMTPS id y1si16850338wiw.27.2015.03.24.05.37.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 24 Mar 2015 05:37:48 -0700 (PDT) Received: by wgdm6 with SMTP id m6so170099819wgd.2 for ; Tue, 24 Mar 2015 05:37:47 -0700 (PDT) Message-ID: <55115A99.40705@plexistor.com> Date: Tue, 24 Mar 2015 14:37:45 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> In-Reply-To: <55100D10.6090902@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh , Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/23/2015 02:54 PM, Boaz Harrosh wrote: > From: Boaz Harrosh > > When freezing an FS, we must write protect all IS_DAX() > inodes that have an mmap mapping on an inode. Otherwise > application will be able to modify previously faulted-in > file pages. > > I'm actually doing a full unmap_mapping_range because > there is no readily available "mapping_write_protect" like > functionality. I do not think it is worth it to define one > just for here and just for some extra read-faults after an > fs_freeze. > > How hot-path is fs_freeze at all? > OK So reinspecting this was a complete raw RFC. I need to do more work on this thing comments below ... > CC: Jan Kara > CC: Matthew Wilcox > CC: Andrew Morton > Signed-off-by: Boaz Harrosh > --- > fs/dax.c | 30 ++++++++++++++++++++++++++++++ > fs/super.c | 3 +++ > include/linux/fs.h | 1 + > 3 files changed, 34 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index d0bd1f4..f3fc28b 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > return dax_zero_page_range(inode, from, length, get_block); > } > EXPORT_SYMBOL_GPL(dax_truncate_page); > + > +/* This is meant to be called as part of freeze_super. otherwise we might > + * Need some extra locking before calling here. > + */ > +void dax_prepare_freeze(struct super_block *sb) > +{ > + struct inode *inode; > + > + /* TODO: each DAX fs has some private mount option to enable DAX. If > + * We made that option a generic MS_DAX_ENABLE super_block flag we could > + * Avoid the 95% extra unneeded loop-on-all-inodes every freeze. > + * if (!(sb->s_flags & MS_DAX_ENABLE)) > + * return 0; > + */ > + > + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + /* TODO: For freezing we can actually do with write-protecting > + * the page. But I cannot find a ready made function that does > + * that for a giving mapping (with all the proper locking). > + * How performance sensitive is the all sb_freeze API? > + * For now we can just unmap the all mapping, and pay extra > + * on read faults. > + */ > + /* NOTE: Do not unmap private COW mapped pages it will not > + * modify the FS. > + */ > + if (IS_DAX(inode)) > + unmap_mapping_range(inode->i_mapping, 0, 0, 0); So what happens here is that we loop on all sb->s_inodes every freeze and in the not DAX case just do nothing. It could be nice to have a flag at the sb level to tel us if we need to expect IS_DAX() inodes at all, for example when we are mounted on an harddisk it should not be set. All of ext2/4 and now Dave's xfs have their own XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it here in Generic code? The option parsing will be done by each FS but the flag be global? > + } > +} > diff --git a/fs/super.c b/fs/super.c > index 2b7dc90..9ef490c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb) > /* All writers are done so after syncing there won't be dirty data */ > sync_filesystem(sb); > > + /* Need to take care of DAX mmaped inodes */ > + dax_prepare_freeze(sb); > + So if CONFIG_FS_DAX is not set this will not compile I need to define an empty one if not set Cheers Boaz > /* Now wait for internal filesystem counter */ > sb->s_writers.frozen = SB_FREEZE_FS; > smp_wmb(); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 24af817..3b943d4 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t); > int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); > int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); > #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) > +void dax_prepare_freeze(struct super_block *sb); > > #ifdef CONFIG_BLOCK > typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) by kanga.kvack.org (Postfix) with ESMTP id 081766B0038 for ; Tue, 24 Mar 2015 22:22:28 -0400 (EDT) Received: by pdbop1 with SMTP id op1so12479585pdb.2 for ; Tue, 24 Mar 2015 19:22:27 -0700 (PDT) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net. [150.101.137.129]) by mx.google.com with ESMTP id oa11si1450279pdb.33.2015.03.24.19.22.25 for ; Tue, 24 Mar 2015 19:22:27 -0700 (PDT) Date: Wed, 25 Mar 2015 13:22:21 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325022221.GA31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <551100E3.9010007@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: > On 03/24/2015 12:40 AM, Dave Chinner wrote: > > On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote: > >> From: Boaz Harrosh > >> > >> When freezing an FS, we must write protect all IS_DAX() > >> inodes that have an mmap mapping on an inode. Otherwise > >> application will be able to modify previously faulted-in > >> file pages. > > > > All you need to do is lock out page faults once the page is clean; > > that's what the sb_start_pagefault() calls are for in the page fault > > path - they catch write faults and block them until the filesystem > > is unfrozen. Hence I'm not sure why this would be necessary if you > > are catching write faults in .pfn_mkwrite.... > > > > Jan pointed it out and he was right I have a test for this. What > happens is that since we had a mapping from before the freeze we will > not have a page-fault. And the buffers will be modified. > > As Jan explained in the cache path we do a writeback which turns > all pages to read-only. But with dax we do not have writeback > so the pages stay read-write mapped. Something needs to loop > through the pages and write-protect them. I chose to unmap > them because it is the much-much smaller code, and I do not care > to optimize the freeze. Then we have wider problem with DAX, then: sync doesn't work properly. i.e. if we still has write mapped pages, then we haven't flushed dirty cache lines on write-mapped files to the persistent domain by the time sync completes. So, this shouldn't be some special case that only the freeze code takes into account - we need to make sure that sync (and therefore freeze) flushes all dirty cache lines and marks all mappings clean.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f175.google.com (mail-pd0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id 865056B0038 for ; Tue, 24 Mar 2015 22:26:38 -0400 (EDT) Received: by pdbcz9 with SMTP id cz9so12552503pdb.3 for ; Tue, 24 Mar 2015 19:26:38 -0700 (PDT) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net. [150.101.137.129]) by mx.google.com with ESMTP id xt6si1447568pbc.59.2015.03.24.19.26.36 for ; Tue, 24 Mar 2015 19:26:37 -0700 (PDT) Date: Wed, 25 Mar 2015 13:26:33 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325022633.GB31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <55115A99.40705@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55115A99.40705@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Tue, Mar 24, 2015 at 02:37:45PM +0200, Boaz Harrosh wrote: > On 03/23/2015 02:54 PM, Boaz Harrosh wrote: > > From: Boaz Harrosh > > > > When freezing an FS, we must write protect all IS_DAX() > > inodes that have an mmap mapping on an inode. Otherwise > > application will be able to modify previously faulted-in > > file pages. > > > > I'm actually doing a full unmap_mapping_range because > > there is no readily available "mapping_write_protect" like > > functionality. I do not think it is worth it to define one > > just for here and just for some extra read-faults after an > > fs_freeze. > > > > How hot-path is fs_freeze at all? > > > > OK So reinspecting this was a complete raw RFC. I need to do > more work on this thing > > comments below ... > > > CC: Jan Kara > > CC: Matthew Wilcox > > CC: Andrew Morton > > Signed-off-by: Boaz Harrosh > > --- > > fs/dax.c | 30 ++++++++++++++++++++++++++++++ > > fs/super.c | 3 +++ > > include/linux/fs.h | 1 + > > 3 files changed, 34 insertions(+) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index d0bd1f4..f3fc28b 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > > return dax_zero_page_range(inode, from, length, get_block); > > } > > EXPORT_SYMBOL_GPL(dax_truncate_page); > > + > > +/* This is meant to be called as part of freeze_super. otherwise we might > > + * Need some extra locking before calling here. > > + */ > > +void dax_prepare_freeze(struct super_block *sb) > > +{ > > + struct inode *inode; > > + > > + /* TODO: each DAX fs has some private mount option to enable DAX. If > > + * We made that option a generic MS_DAX_ENABLE super_block flag we could > > + * Avoid the 95% extra unneeded loop-on-all-inodes every freeze. > > + * if (!(sb->s_flags & MS_DAX_ENABLE)) > > + * return 0; > > + */ > > + > > + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { missing locking. > > + /* TODO: For freezing we can actually do with write-protecting > > + * the page. But I cannot find a ready made function that does > > + * that for a giving mapping (with all the proper locking). > > + * How performance sensitive is the all sb_freeze API? > > + * For now we can just unmap the all mapping, and pay extra > > + * on read faults. > > + */ > > + /* NOTE: Do not unmap private COW mapped pages it will not > > + * modify the FS. > > + */ > > + if (IS_DAX(inode)) > > + unmap_mapping_range(inode->i_mapping, 0, 0, 0); > > So what happens here is that we loop on all sb->s_inodes every freeze > and in the not DAX case just do nothing. Which is real bad and known to be a performance issue. See Josef's recent sync scalability patchset posting that only tracks and walks dirty inodes... > It could be nice to have a flag at the sb level to tel us if we need > to expect IS_DAX() inodes at all, for example when we are mounted on > an harddisk it should not be set. > > All of ext2/4 and now Dave's xfs have their own > XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX > > Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it > here in Generic code? The option parsing will be done by each FS but > the flag be global? No, because as I mentioned in another thread we're going to end up with filesystems that don't have "mount wide" DAX behaviour, and we have to check every dirty inode anyway. And.... > > diff --git a/fs/super.c b/fs/super.c > > index 2b7dc90..9ef490c 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb) > > /* All writers are done so after syncing there won't be dirty data */ > > sync_filesystem(sb); > > > > + /* Need to take care of DAX mmaped inodes */ > > + dax_prepare_freeze(sb); > > + > > So if CONFIG_FS_DAX is not set this will not compile I need to > define an empty one if not set ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling this problem, so that sync and fsync work correctly, and then you don't care about whether DAX is supported or not... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 26EC56B0038 for ; Wed, 25 Mar 2015 04:10:36 -0400 (EDT) Received: by wgbcc7 with SMTP id cc7so17964368wgb.0 for ; Wed, 25 Mar 2015 01:10:35 -0700 (PDT) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com. [209.85.212.182]) by mx.google.com with ESMTPS id it3si3716015wid.46.2015.03.25.01.10.33 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Mar 2015 01:10:34 -0700 (PDT) Received: by wibg7 with SMTP id g7so99602001wib.1 for ; Wed, 25 Mar 2015 01:10:33 -0700 (PDT) Message-ID: <55126D77.7040105@plexistor.com> Date: Wed, 25 Mar 2015 10:10:31 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> In-Reply-To: <20150325022221.GA31342@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/25/2015 04:22 AM, Dave Chinner wrote: > On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: <> > > Then we have wider problem with DAX, then: sync doesn't work > properly. i.e. if we still has write mapped pages, then we haven't > flushed dirty cache lines on write-mapped files to the persistent > domain by the time sync completes. > > So, this shouldn't be some special case that only the freeze code > takes into account - we need to make sure that sync (and therefore > freeze) flushes all dirty cache lines and marks all mappings > clean.... > This is not how I understood it and how I read the code. The sync does happen, .fsync of the FS is called on each file just as if the user called it. If this is broken it just needs to be fixed there at the .fsync vector. POSIX mandate persistence at .fsync so at the vfs layer we rely on that. So everything at this stage should be synced to real media. What does not happen is writeback. since dax does not have any writeback. And because of that nothing turned the user mappings to read only. This is what I do here but instead of write-protecting I just unmap because it is easier for me to code it. > Cheers, > Dave. Cheers Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id C069B6B0038 for ; Wed, 25 Mar 2015 04:31:28 -0400 (EDT) Received: by wixw10 with SMTP id w10so68773874wix.0 for ; Wed, 25 Mar 2015 01:31:28 -0700 (PDT) Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com. [74.125.82.45]) by mx.google.com with ESMTPS id p12si3026079wjr.195.2015.03.25.01.31.25 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Mar 2015 01:31:25 -0700 (PDT) Received: by wgra20 with SMTP id a20so18411321wgr.3 for ; Wed, 25 Mar 2015 01:31:25 -0700 (PDT) Message-ID: <5512725A.1010905@plexistor.com> Date: Wed, 25 Mar 2015 10:31:22 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <55115A99.40705@plexistor.com> <20150325022633.GB31342@dastard> In-Reply-To: <20150325022633.GB31342@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/25/2015 04:26 AM, Dave Chinner wrote: <> >>> + /* TODO: each DAX fs has some private mount option to enable DAX. If >>> + * We made that option a generic MS_DAX_ENABLE super_block flag we could >>> + * Avoid the 95% extra unneeded loop-on-all-inodes every freeze. >>> + * if (!(sb->s_flags & MS_DAX_ENABLE)) >>> + * return 0; >>> + */ >>> + >>> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > > missing locking. > I will please need help here. This is very deep inside the freeze process we area already holding bunch of locks. We know that nothing can be modified at this stage. We are completely read-only. Only thing I can see that can happen is inode eviction do to oom. So do I need an iget. But how do I know the iget is allowed here? OK I do not have a clue what locks do I need, this deep in the freeze? >>> + /* TODO: For freezing we can actually do with write-protecting >>> + * the page. But I cannot find a ready made function that does >>> + * that for a giving mapping (with all the proper locking). >>> + * How performance sensitive is the all sb_freeze API? >>> + * For now we can just unmap the all mapping, and pay extra >>> + * on read faults. >>> + */ >>> + /* NOTE: Do not unmap private COW mapped pages it will not >>> + * modify the FS. >>> + */ >>> + if (IS_DAX(inode)) >>> + unmap_mapping_range(inode->i_mapping, 0, 0, 0); >> >> So what happens here is that we loop on all sb->s_inodes every freeze >> and in the not DAX case just do nothing. > > Which is real bad and known to be a performance issue. See Josef's > recent sync scalability patchset posting that only tracks and walks > dirty inodes... > Sure but how hot is freeze? Josef's fixed the very hot sync path, but freeze happens once in a blue moon. Do we care? >> It could be nice to have a flag at the sb level to tel us if we need >> to expect IS_DAX() inodes at all, for example when we are mounted on >> an harddisk it should not be set. >> >> All of ext2/4 and now Dave's xfs have their own >> XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX >> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it >> here in Generic code? The option parsing will be done by each FS but >> the flag be global? > > No, because as I mentioned in another thread we're going to end up > with filesystems that don't have "mount wide" DAX behaviour, and we > have to check every dirty inode anyway. And.... > Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX if there is any chance at all that IS_DAX() comes out true, so we loop here. OK You know what, I will change this check to be: if (sb->s_bdev->bd_disk->fops->direct_access) BTW: We must loop this way on every sb inode because we do not have dirty inodes. There is no "dirty"ing going on in dax, not of inodes and not of pages. >>> diff --git a/fs/super.c b/fs/super.c >>> index 2b7dc90..9ef490c 100644 >>> --- a/fs/super.c >>> +++ b/fs/super.c >>> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb) >>> /* All writers are done so after syncing there won't be dirty data */ >>> sync_filesystem(sb); >>> >>> + /* Need to take care of DAX mmaped inodes */ >>> + dax_prepare_freeze(sb); >>> + >> >> So if CONFIG_FS_DAX is not set this will not compile I need to >> define an empty one if not set > > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling > this problem, so that sync and fsync work correctly, and then you > don't care about whether DAX is supported or not... > sync and fsync should and will work correctly, but this does not solve our problem. because what turns pages to read-only is the writeback. And we do not have this in dax. Therefore we need to do this here as a special case. > Cheers, > Dave. > I have a new patchset with all this, I will send it once it is fully tested, I have problems testing both freeze and splice there are not any good tests that I could find that do what I want, so still working. Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id EA8006B0038 for ; Wed, 25 Mar 2015 05:29:39 -0400 (EDT) Received: by pabxg6 with SMTP id xg6so23258044pab.0 for ; Wed, 25 Mar 2015 02:29:39 -0700 (PDT) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net. [150.101.137.131]) by mx.google.com with ESMTP id ok14si2916013pdb.2.2015.03.25.02.29.37 for ; Wed, 25 Mar 2015 02:29:38 -0700 (PDT) Date: Wed, 25 Mar 2015 20:29:22 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325092922.GH31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> <55126D77.7040105@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55126D77.7040105@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote: > On 03/25/2015 04:22 AM, Dave Chinner wrote: > > On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: > <> > > > > Then we have wider problem with DAX, then: sync doesn't work > > properly. i.e. if we still has write mapped pages, then we haven't > > flushed dirty cache lines on write-mapped files to the persistent > > domain by the time sync completes. > > > > So, this shouldn't be some special case that only the freeze code > > takes into account - we need to make sure that sync (and therefore > > freeze) flushes all dirty cache lines and marks all mappings > > clean.... > > > > This is not how I understood it and how I read the code. > > The sync does happen, .fsync of the FS is called on each > file just as if the user called it. If this is broken it just > needs to be fixed there at the .fsync vector. POSIX mandate > persistence at .fsync so at the vfs layer we rely on that. right now, the filesystems will see that there are no dirty pages on the inode, and then just sync the inode metadata. They will do nothing else as filesystems are not aware of CPU cachelines at all. > So everything at this stage should be synced to real media. Actually no. This is what intel are introducing new CPU instructions for - so fsync can flush the cpu caches and commit them to th persistence domain correctly. > What does not happen is writeback. since dax does not have > any writeback. Which is precisely the problem we need to address - we don't need writeback to a block device, but we do need the dirty CPU cachelines flushed and the mappings cleaned. > And because of that nothing turned the > user mappings to read only. This is what I do here but > instead of write-protecting I just unmap because it is > easier for me to code it. That doesn't mean it is the correct solution. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id DD5026B0038 for ; Wed, 25 Mar 2015 05:41:52 -0400 (EDT) Received: by pdbcz9 with SMTP id cz9so23054800pdb.3 for ; Wed, 25 Mar 2015 02:41:52 -0700 (PDT) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net. [150.101.137.131]) by mx.google.com with ESMTP id yy3si2862538pbb.193.2015.03.25.02.41.50 for ; Wed, 25 Mar 2015 02:41:52 -0700 (PDT) Date: Wed, 25 Mar 2015 20:41:35 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325094135.GI31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <55115A99.40705@plexistor.com> <20150325022633.GB31342@dastard> <5512725A.1010905@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5512725A.1010905@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote: > On 03/25/2015 04:26 AM, Dave Chinner wrote: > <> > >>> + /* TODO: each DAX fs has some private mount option to enable DAX. If > >>> + * We made that option a generic MS_DAX_ENABLE super_block flag we could > >>> + * Avoid the 95% extra unneeded loop-on-all-inodes every freeze. > >>> + * if (!(sb->s_flags & MS_DAX_ENABLE)) > >>> + * return 0; > >>> + */ > >>> + > >>> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > > > > missing locking. > > > > I will please need help here. This is very deep inside the freeze process > we area already holding bunch of locks. We know that nothing can be modified > at this stage. We are completely read-only. Which means we could stillbe reading new inodes in off disk and hence the sb->s_inodes list can be changing. Memory reclaim can be running via the shrinker, freeing clean inodes, hence the sb->s_inodes list can be changing. >>From fs/inode.c: /* * Inode locking rules: ..... * inode_sb_list_lock protects: * sb->s_inodes, inode->i_sb_list This... > >>> + /* TODO: For freezing we can actually do with write-protecting > >>> + * the page. But I cannot find a ready made function that does > >>> + * that for a giving mapping (with all the proper locking). > >>> + * How performance sensitive is the all sb_freeze API? > >>> + * For now we can just unmap the all mapping, and pay extra > >>> + * on read faults. > >>> + */ > >>> + /* NOTE: Do not unmap private COW mapped pages it will not > >>> + * modify the FS. > >>> + */ > >>> + if (IS_DAX(inode)) > >>> + unmap_mapping_range(inode->i_mapping, 0, 0, 0); > >> > >> So what happens here is that we loop on all sb->s_inodes every freeze > >> and in the not DAX case just do nothing. > > > > Which is real bad and known to be a performance issue. See Josef's > > recent sync scalability patchset posting that only tracks and walks > > dirty inodes... > > Sure but how hot is freeze? Josef's fixed the very hot sync path, > but freeze happens once in a blue moon. Do we care? Yes, because if you have 50 million cached inodes on a filesystem, it's going to take a long time to traverse them all, and right now the inode_sb_list_lock is a *global lock*. > >> It could be nice to have a flag at the sb level to tel us if we need > >> to expect IS_DAX() inodes at all, for example when we are mounted on > >> an harddisk it should not be set. > >> > >> All of ext2/4 and now Dave's xfs have their own > >> XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX > >> > >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it > >> here in Generic code? The option parsing will be done by each FS but > >> the flag be global? > > > > No, because as I mentioned in another thread we're going to end up > > with filesystems that don't have "mount wide" DAX behaviour, and we > > have to check every dirty inode anyway. And.... > > > > Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX > if there is any chance at all that IS_DAX() comes out true, so we loop > here. The mount option is irrelevant here - we should only be looping over dirty inodes. We don't care if they are DAX or not - we have to iterate them and ensure they are properly clean. We already have infrastructure to do this - we should use it and fix the problem once and for all rather than hacking special case code into random places. > BTW: We must loop this way on every sb inode because we do not have > dirty inodes. There is no "dirty"ing going on in dax, not of inodes > and not of pages. Precisely the problem we need to address. We do have dirty inodes, we just never set the fact they have dirty "pages" on them and hence never do data writeback on them. > > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling > > this problem, so that sync and fsync work correctly, and then you > > don't care about whether DAX is supported or not... > > > > sync and fsync should and will work correctly, but this does not > solve our problem. because what turns pages to read-only is the > writeback. And we do not have this in dax. Therefore we need to > do this here as a special case. We can still use exactly the same dirty tracking as we use for data writeback. The difference is that we don't need to go through all teh page writeback; we can just flush the CPU caches and mark all the mappings clean, then clear the I_DIRTY_PAGES flag and move on to inode writeback.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id AA1036B0038 for ; Wed, 25 Mar 2015 06:19:54 -0400 (EDT) Received: by wibg7 with SMTP id g7so103532235wib.1 for ; Wed, 25 Mar 2015 03:19:54 -0700 (PDT) Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com. [209.85.212.172]) by mx.google.com with ESMTPS id h9si3502497wjy.213.2015.03.25.03.19.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Mar 2015 03:19:53 -0700 (PDT) Received: by wixw10 with SMTP id w10so31414889wix.0 for ; Wed, 25 Mar 2015 03:19:52 -0700 (PDT) Message-ID: <55128BC6.7090105@plexistor.com> Date: Wed, 25 Mar 2015 12:19:50 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> <55126D77.7040105@plexistor.com> <20150325092922.GH31342@dastard> In-Reply-To: <20150325092922.GH31342@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/25/2015 11:29 AM, Dave Chinner wrote: > On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote: >> On 03/25/2015 04:22 AM, Dave Chinner wrote: >>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: >> <> <> >> The sync does happen, .fsync of the FS is called on each >> file just as if the user called it. If this is broken it just >> needs to be fixed there at the .fsync vector. POSIX mandate >> persistence at .fsync so at the vfs layer we rely on that. > > right now, the filesystems will see that there are no dirty pages > on the inode, and then just sync the inode metadata. They will do > nothing else as filesystems are not aware of CPU cachelines at all. > Sigh yes. There is this bug. And I am sitting on a wide fix for this. The strategy is. All Kernel writes are done with a new copy_user_nt. NT stands for none-temporal. This shows 20% improvements since cachelines need not be fetched when written too. The arches that do not have NT instructions, will use a generic copy_user_nt that does a copy_user and then flush cashes. Same flush cashes we do before DMA IO. (effectively every 4k) [Its more complicated with the edges and all, by I have solved all this. Will post in a week or two] So what is left is the mmaped inodes. The logic here is that at .fsync vector dax inodes will do a cl_flush only if mapping_mapped() is true. Also .msync is the same as .fsync And one last thing we also call .fsync at vm_operations_struct->close because it is allowed for an app to do mmap, munmap, .fsync so we just call dax .fsync at munmap always. So by now we should be covered for fsync guaranty. >> So everything at this stage should be synced to real media. > > Actually no. This is what intel are introducing new CPU instructions > for - so fsync can flush the cpu caches and commit them to th > persistence domain correctly. > The new intel instructions are for an optimization, and they will fit in the picture for the CPUs that have it. But there are already NT instructions for existing CPUs. (Just not as fast and precise) Every ARCH will do its best under a small API copy_user_nt - data is at media memset_nt - data is at media cl_flush - partial written cachelines flushed to media sfence - New data seen by all CPUs >> What does not happen is writeback. since dax does not have >> any writeback. > > Which is precisely the problem we need to address - we don't need > writeback to a block device, but we do need the dirty CPU cachelines > flushed and the mappings cleaned. > I see what you mean. Since nothing dirtied the inode then above .fsync will not be called and we have not pushed mmap data to media. Again here we only need to do this for mmaped inodes, because Kernel written data is (will be) written NT style. >> And because of that nothing turned the >> user mappings to read only. This is what I do here but >> instead of write-protecting I just unmap because it is >> easier for me to code it. > > That doesn't mean it is the correct solution. Please note that even if we properly .fsync cachlines the page-faults are orthogonal to this. There is no point in making mmapped dax pages read-only after every .fsync and pay a page-fault. We should leave them mapped has is. The only place that we need page protection is at freeze time. But I see that we might have a problem with .fsync not being called. I see that you sent a second mail. I'll try to answer there. > Cheers, > Dave. Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id 0934F6B0038 for ; Wed, 25 Mar 2015 06:40:50 -0400 (EDT) Received: by wibbg6 with SMTP id bg6so18023804wib.0 for ; Wed, 25 Mar 2015 03:40:49 -0700 (PDT) Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com. [209.85.212.171]) by mx.google.com with ESMTPS id ua5si3602506wjc.197.2015.03.25.03.40.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Mar 2015 03:40:48 -0700 (PDT) Received: by wixw10 with SMTP id w10so32181048wix.0 for ; Wed, 25 Mar 2015 03:40:47 -0700 (PDT) Message-ID: <551290AC.7080402@plexistor.com> Date: Wed, 25 Mar 2015 12:40:44 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <55115A99.40705@plexistor.com> <20150325022633.GB31342@dastard> <5512725A.1010905@plexistor.com> <20150325094135.GI31342@dastard> In-Reply-To: <20150325094135.GI31342@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/25/2015 11:41 AM, Dave Chinner wrote: > On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote: >> On 03/25/2015 04:26 AM, Dave Chinner wrote: <> >> sync and fsync should and will work correctly, but this does not >> solve our problem. because what turns pages to read-only is the >> writeback. And we do not have this in dax. Therefore we need to >> do this here as a special case. > > We can still use exactly the same dirty tracking as we use for data > writeback. The difference is that we don't need to go through all > teh page writeback; we can just flush the CPU caches and mark all > the mappings clean, then clear the I_DIRTY_PAGES flag and move on to > inode writeback.... > I see what you mean. the sb wide sync will not step into mmaped inodes and fsync them. If we go my way and write NT (None Temporal) style in Kernel. NT instructions exist since xeon and all the Intel iX core CPUs have them. In tests we conducted doing xeon NT-writes vs regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement. That is on very large IOs. On 4k IOs it was even better. It looks like you have a much better picture in your mind how to fit this properly at the inode-dirty picture. Can you attempt a rough draft? If we are going the NT way. Then we can only I_DIRTY_ track the mmaped inodes. For me this is really scary because I do not want to trigger any writeback threads. If you could please draw me an outline (or write something up ;-)) it would be great. > Cheers, > Dave. Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id E6F566B0032 for ; Wed, 25 Mar 2015 16:00:30 -0400 (EDT) Received: by pagj7 with SMTP id j7so39365191pag.2 for ; Wed, 25 Mar 2015 13:00:30 -0700 (PDT) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net. [150.101.137.143]) by mx.google.com with ESMTP id to6si5105084pac.14.2015.03.25.13.00.28 for ; Wed, 25 Mar 2015 13:00:29 -0700 (PDT) Date: Thu, 26 Mar 2015 07:00:24 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325200024.GJ31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> <55126D77.7040105@plexistor.com> <20150325092922.GH31342@dastard> <55128BC6.7090105@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55128BC6.7090105@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote: > On 03/25/2015 11:29 AM, Dave Chinner wrote: > > On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote: > >> On 03/25/2015 04:22 AM, Dave Chinner wrote: > >>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: > >> <> > <> > >> The sync does happen, .fsync of the FS is called on each > >> file just as if the user called it. If this is broken it just > >> needs to be fixed there at the .fsync vector. POSIX mandate > >> persistence at .fsync so at the vfs layer we rely on that. > > > > right now, the filesystems will see that there are no dirty pages > > on the inode, and then just sync the inode metadata. They will do > > nothing else as filesystems are not aware of CPU cachelines at all. > > > > Sigh yes. There is this bug. And I am sitting on a wide fix for this. > > The strategy is. All Kernel writes are done with a new copy_user_nt. > NT stands for none-temporal. This shows 20% improvements since cachelines > need not be fetched when written too. That's unenforcable for mmap writes from userspace. And those are the writes that will trigger the dirty write mapping problem. > >> And because of that nothing turned the > >> user mappings to read only. This is what I do here but > >> instead of write-protecting I just unmap because it is > >> easier for me to code it. > > > > That doesn't mean it is the correct solution. > > Please note that even if we properly .fsync cachlines the page-faults > are orthogonal to this. There is no point in making mmapped dax pages > read-only after every .fsync and pay a page-fault. We should leave them > mapped has is. The only place that we need page protection is at freeze > time. Actually, current behaviour of filesystems is that fsync cleans all the pages in the range, and means all the mappings are marked read-only and so we get new calls into .page_mkwrite when write faults occur. We need that .page_mkwrite call to be able to a) update the mtime of the inode, and b) mark the inode "data dirty" so that fsync knows it needs to do something.... Hence I'd much prefer we start with identical behaviour to normal files, then we can optimise from a sane start point when write page faults show up as a performance problem. i.e. Correctness first, performance second. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f174.google.com (mail-pd0-f174.google.com [209.85.192.174]) by kanga.kvack.org (Postfix) with ESMTP id 303906B0032 for ; Wed, 25 Mar 2015 16:05:21 -0400 (EDT) Received: by pdbcz9 with SMTP id cz9so38773235pdb.3 for ; Wed, 25 Mar 2015 13:05:20 -0700 (PDT) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net. [150.101.137.143]) by mx.google.com with ESMTP id ce13si5058612pdb.224.2015.03.25.13.05.19 for ; Wed, 25 Mar 2015 13:05:20 -0700 (PDT) Date: Thu, 26 Mar 2015 07:05:16 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150325200516.GK31342@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <55115A99.40705@plexistor.com> <20150325022633.GB31342@dastard> <5512725A.1010905@plexistor.com> <20150325094135.GI31342@dastard> <551290AC.7080402@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <551290AC.7080402@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Wed, Mar 25, 2015 at 12:40:44PM +0200, Boaz Harrosh wrote: > On 03/25/2015 11:41 AM, Dave Chinner wrote: > > On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote: > >> On 03/25/2015 04:26 AM, Dave Chinner wrote: > <> > >> sync and fsync should and will work correctly, but this does not > >> solve our problem. because what turns pages to read-only is the > >> writeback. And we do not have this in dax. Therefore we need to > >> do this here as a special case. > > > > We can still use exactly the same dirty tracking as we use for data > > writeback. The difference is that we don't need to go through all > > teh page writeback; we can just flush the CPU caches and mark all > > the mappings clean, then clear the I_DIRTY_PAGES flag and move on to > > inode writeback.... > > > > I see what you mean. the sb wide sync will not step into mmaped inodes > and fsync them. > > If we go my way and write NT (None Temporal) style in Kernel. > NT instructions exist since xeon and all the Intel iX core CPUs have > them. In tests we conducted doing xeon NT-writes vs > regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement. > That is on very large IOs. On 4k IOs it was even better. As I said before, relying on specific instructions is a non-starter for mmap writes, and that's the problem we need to solve here. > It looks like you have a much better picture in your mind how to > fit this properly at the inode-dirty picture. Can you attempt a rough draft? > > If we are going the NT way. Then we can only I_DIRTY_ track the mmaped > inodes. For me this is really scary because I do not want to trigger > any writeback threads. If you could please draw me an outline (or write > something up ;-)) it would be great. Writeback threads are not used for fsync - they are used for sync and background writeback. They are already active on DAX filesystems that track dirty inodes on the VFS superblock, as this is the way inodes are written back on some filesystems. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by kanga.kvack.org (Postfix) with ESMTP id 82F996B006E for ; Thu, 26 Mar 2015 04:02:13 -0400 (EDT) Received: by wibg7 with SMTP id g7so10417896wib.1 for ; Thu, 26 Mar 2015 01:02:13 -0700 (PDT) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com. [209.85.212.178]) by mx.google.com with ESMTPS id p12si8554546wjr.195.2015.03.26.01.02.11 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 26 Mar 2015 01:02:12 -0700 (PDT) Received: by wiaa2 with SMTP id a2so10576627wia.0 for ; Thu, 26 Mar 2015 01:02:11 -0700 (PDT) Message-ID: <5513BD01.5080603@plexistor.com> Date: Thu, 26 Mar 2015 10:02:09 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> <55126D77.7040105@plexistor.com> <20150325092922.GH31342@dastard> <55128BC6.7090105@plexistor.com> <20150325200024.GJ31342@dastard> In-Reply-To: <20150325200024.GJ31342@dastard> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On 03/25/2015 10:00 PM, Dave Chinner wrote: > On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote: >> On 03/25/2015 11:29 AM, Dave Chinner wrote: >>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote: >>>> On 03/25/2015 04:22 AM, Dave Chinner wrote: >>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: >>>> <> >> <> >>>> The sync does happen, .fsync of the FS is called on each >>>> file just as if the user called it. If this is broken it just >>>> needs to be fixed there at the .fsync vector. POSIX mandate >>>> persistence at .fsync so at the vfs layer we rely on that. >>> >>> right now, the filesystems will see that there are no dirty pages >>> on the inode, and then just sync the inode metadata. They will do >>> nothing else as filesystems are not aware of CPU cachelines at all. >>> >> >> Sigh yes. There is this bug. And I am sitting on a wide fix for this. >> >> The strategy is. All Kernel writes are done with a new copy_user_nt. >> NT stands for none-temporal. This shows 20% improvements since cachelines >> need not be fetched when written too. > > That's unenforcable for mmap writes from userspace. And those are > the writes that will trigger the dirty write mapping problem. > So for them I was thinking of just doing the .fsync on every unmap (ie vm_operations_struct->close) So now we know that only inodes that have an active vm mapping are in need of sync. >>>> And because of that nothing turned the >>>> user mappings to read only. This is what I do here but >>>> instead of write-protecting I just unmap because it is >>>> easier for me to code it. >>> >>> That doesn't mean it is the correct solution. >> >> Please note that even if we properly .fsync cachlines the page-faults >> are orthogonal to this. There is no point in making mmapped dax pages >> read-only after every .fsync and pay a page-fault. We should leave them >> mapped has is. The only place that we need page protection is at freeze >> time. > > Actually, current behaviour of filesystems is that fsync cleans all > the pages in the range, and means all the mappings are marked > read-only and so we get new calls into .page_mkwrite when write > faults occur. We need that .page_mkwrite call to be able to a) > update the mtime of the inode, and b) mark the inode "data dirty" so > that fsync knows it needs to do something.... > > Hence I'd much prefer we start with identical behaviour to normal > files, then we can optimise from a sane start point when write page > faults show up as a performance problem. i.e. Correctness first, > performance second. > OK, (you see when you speak slow I understand fast ;-)). I agree then I'll see if I can schedule some time for this. My boss will be very angry with me about this. But I will need help please, and some hands holding. Unless someone else volunteers to work on this ? > Cheers, > Dave. > Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id 24AC46B0032 for ; Thu, 26 Mar 2015 16:58:24 -0400 (EDT) Received: by pabxg6 with SMTP id xg6so74140454pab.0 for ; Thu, 26 Mar 2015 13:58:23 -0700 (PDT) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net. [150.101.137.131]) by mx.google.com with ESMTP id sv6si9913860pbc.7.2015.03.26.13.58.21 for ; Thu, 26 Mar 2015 13:58:22 -0700 (PDT) Date: Fri, 27 Mar 2015 07:58:05 +1100 From: Dave Chinner Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze Message-ID: <20150326205805.GC28129@dastard> References: <55100B78.501@plexistor.com> <55100D10.6090902@plexistor.com> <20150323224047.GQ28621@dastard> <551100E3.9010007@plexistor.com> <20150325022221.GA31342@dastard> <55126D77.7040105@plexistor.com> <20150325092922.GH31342@dastard> <55128BC6.7090105@plexistor.com> <20150325200024.GJ31342@dastard> <5513BD01.5080603@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5513BD01.5080603@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan On Thu, Mar 26, 2015 at 10:02:09AM +0200, Boaz Harrosh wrote: > On 03/25/2015 10:00 PM, Dave Chinner wrote: > > On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote: > >> On 03/25/2015 11:29 AM, Dave Chinner wrote: > >>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote: > >>>> On 03/25/2015 04:22 AM, Dave Chinner wrote: > >>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote: > >>>> <> > >> <> > >>>> The sync does happen, .fsync of the FS is called on each > >>>> file just as if the user called it. If this is broken it just > >>>> needs to be fixed there at the .fsync vector. POSIX mandate > >>>> persistence at .fsync so at the vfs layer we rely on that. > >>> > >>> right now, the filesystems will see that there are no dirty pages > >>> on the inode, and then just sync the inode metadata. They will do > >>> nothing else as filesystems are not aware of CPU cachelines at all. > >>> > >> > >> Sigh yes. There is this bug. And I am sitting on a wide fix for this. > >> > >> The strategy is. All Kernel writes are done with a new copy_user_nt. > >> NT stands for none-temporal. This shows 20% improvements since cachelines > >> need not be fetched when written too. > > > > That's unenforcable for mmap writes from userspace. And those are > > the writes that will trigger the dirty write mapping problem. > > > > So for them I was thinking of just doing the .fsync on every > unmap (ie vm_operations_struct->close) That is not necessary, I think - it can be handled by the background writeback thread just fine. > So now we know that only inodes that have an active vm mapping > are in need of sync. Easy enough. > >> Please note that even if we properly .fsync cachlines the page-faults > >> are orthogonal to this. There is no point in making mmapped dax pages > >> read-only after every .fsync and pay a page-fault. We should leave them > >> mapped has is. The only place that we need page protection is at freeze > >> time. > > > > Actually, current behaviour of filesystems is that fsync cleans all > > the pages in the range, and means all the mappings are marked > > read-only and so we get new calls into .page_mkwrite when write > > faults occur. We need that .page_mkwrite call to be able to a) > > update the mtime of the inode, and b) mark the inode "data dirty" so > > that fsync knows it needs to do something.... > > > > Hence I'd much prefer we start with identical behaviour to normal > > files, then we can optimise from a sane start point when write page > > faults show up as a performance problem. i.e. Correctness first, > > performance second. > > OK, (you see when you speak slow I understand fast ;-)). I agree then > I'll see if I can schedule some time for this. My boss will be very > angry with me about this. But I will need help please, and some hands > holding. Unless someone else volunteers to work on this ? It's not hard - you should be able to make somethign work from the untested, uncompiled skeleton below.... Cheers, Dave. -- Dave Chinner david@fromorbit.com dax: hack in dirty mapping tracking to fsync/sync/writeback Not-signed-off-by: Dave Chinner --- fs/dax.c | 27 ++++++++++++++++++++++++++- fs/xfs/xfs_file.c | 2 ++ mm/page-writeback.c | 5 +++++ 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 0121f7d..61cbd76 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -27,6 +27,29 @@ #include #include +/* + * flush the mapping to the persistent domain within the byte range of (start, + * end). This is required by data integrity operations to ensure file data is on + * persistent storage prior to completion of the operation. It also requires us + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when + * the file is written to again so wehave an indication that we need to flush + * the mapping if a data integrity operation takes place. + * + * We don't need commits to storage here - the filesystems will issue flushes + * appropriately at the conclusion of the data integrity operation via REQ_FUA + * writes or blkdev_issue_flush() commands. This requires the DAX block device + * to implement persistent storage domain fencing/commits on receiving a + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher + * layers. + */ +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end) +{ + /* XXX: make ptes in range clean */ + + /* XXX: flush CPU caches */ + return 0; +} + int dax_clear_blocks(struct inode *inode, sector_t block, long size) { struct block_device *bdev = inode->i_sb->s_bdev; @@ -472,8 +495,10 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, file_update_time(vma->vm_file); } result = __dax_fault(vma, vmf, get_block, complete_unwritten); - if (vmf->flags & FAULT_FLAG_WRITE) + if (vmf->flags & FAULT_FLAG_WRITE) { + __mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES); sb_end_pagefault(sb); + } return result; } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 8017175..43e6c8e 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1453,6 +1453,8 @@ xfs_filemap_page_mkwrite( } xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + if (IS_DAX(inode)) + __mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES); sb_end_pagefault(inode->i_sb); return ret; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 45e187b..aa2fa76 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2029,6 +2029,11 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; + + if (wbc->sync == WB_SYNC_ALL && IS_DAX(mapping->host)) + return dax_flush_mapping(mapping, wbc->range_start, + wbc->range_end); + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org