* [PATCH] ext4: introduce per-inode DAX flag @ 2017-08-02 16:09 Lukas Czerner 2017-08-05 8:46 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Lukas Czerner @ 2017-08-02 16:09 UTC (permalink / raw) To: linux-ext4; +Cc: Lukas Czerner Currently it's only possible to enable, or disable DAX via mount option on ext4. However some applications may only want to enable DAX for certain files on a file system. This patch introduces new user visible and user modifiable ext4 inode flag EXT4_DAX_FL. This flag is also added into EXT4_FL_INHERITED which means that new inodes created in a directory with EXT4_DAX_FL set will automatically inherit the inode flag. With this patch ext4 is on par with xfs when it comes to per inode DAX enablement. Note that there are some changes due to this patch to consider. Regural users are now able to set up their files to use DAX which was previously only possible by mounting the file system with -o dax option. Any IO done to the file with the flag se will fail in case the block device is not DAX enabled. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 10 ++++++---- fs/ext4/inode.c | 6 +++--- fs/ext4/ioctl.c | 7 ++++++- 3 files changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9ebde0c..689d003 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -395,14 +395,15 @@ struct flex_groups { #define EXT4_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/ #define EXT4_HUGE_FILE_FL 0x00040000 /* Set to each huge file */ #define EXT4_EXTENTS_FL 0x00080000 /* Inode uses extents */ +#define EXT4_DAX_FL 0x00100000 /* Use DAX for IO */ #define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */ #define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */ #define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */ #define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */ #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */ -#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */ -#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */ +#define EXT4_FL_USER_VISIBLE 0x305BDFFF /* User visible flags */ +#define EXT4_FL_USER_MODIFIABLE 0x205BC0FF /* User modifiable flags */ /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */ #define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \ @@ -410,14 +411,15 @@ struct flex_groups { EXT4_APPEND_FL | \ EXT4_NODUMP_FL | \ EXT4_NOATIME_FL | \ - EXT4_PROJINHERIT_FL) + EXT4_PROJINHERIT_FL |\ + EXT4_DAX_FL) /* Flags that should be inherited by new inodes from their parent. */ #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\ EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\ EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\ - EXT4_PROJINHERIT_FL) + EXT4_PROJINHERIT_FL | EXT4_DAX_FL) /* Flags that are appropriate for regular files (all but dir-specific ones). */ #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL)) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3c600f0..051c4db5 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4555,9 +4555,9 @@ void ext4_set_inode_flags(struct inode *inode) new_fl |= S_NOATIME; if (flags & EXT4_DIRSYNC_FL) new_fl |= S_DIRSYNC; - if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) && - !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) && - !ext4_encrypted_inode(inode)) + if ((test_opt(inode->i_sb, DAX) || flags & EXT4_DAX_FL) && + S_ISREG(inode->i_mode) && !ext4_should_journal_data(inode) && + !ext4_has_inline_data(inode) && !ext4_encrypted_inode(inode)) new_fl |= S_DAX; inode_set_flags(inode, new_fl, S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX); diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 42b3a73..a46b606 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -424,12 +424,15 @@ static inline __u32 ext4_iflags_to_xflags(unsigned long iflags) xflags |= FS_XFLAG_NOATIME; if (iflags & EXT4_PROJINHERIT_FL) xflags |= FS_XFLAG_PROJINHERIT; + if (iflags & EXT4_DAX_FL) + xflags |= FS_XFLAG_DAX; return xflags; } #define EXT4_SUPPORTED_FS_XFLAGS (FS_XFLAG_SYNC | FS_XFLAG_IMMUTABLE | \ FS_XFLAG_APPEND | FS_XFLAG_NODUMP | \ - FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT) + FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT | \ + FS_XFLAG_DAX) /* Transfer xflags flags to internal */ static inline unsigned long ext4_xflags_to_iflags(__u32 xflags) @@ -448,6 +451,8 @@ static inline unsigned long ext4_xflags_to_iflags(__u32 xflags) iflags |= EXT4_NOATIME_FL; if (xflags & FS_XFLAG_PROJINHERIT) iflags |= EXT4_PROJINHERIT_FL; + if (xflags & FS_XFLAG_DAX) + iflags |= EXT4_DAX_FL; return iflags; } -- 2.7.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-02 16:09 [PATCH] ext4: introduce per-inode DAX flag Lukas Czerner @ 2017-08-05 8:46 ` Christoph Hellwig 2017-08-07 12:12 ` Lukas Czerner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2017-08-05 8:46 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-ext4 Please don't do this until we have a coherent DAX plan. So far the flag in XFS seems like a massive mistake and we need to get rid of it instead of adding it to more file systems. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-05 8:46 ` Christoph Hellwig @ 2017-08-07 12:12 ` Lukas Czerner 2017-08-08 9:00 ` Lukas Czerner 0 siblings, 1 reply; 24+ messages in thread From: Lukas Czerner @ 2017-08-07 12:12 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-ext4 On Sat, Aug 05, 2017 at 01:46:01AM -0700, Christoph Hellwig wrote: > Please don't do this until we have a coherent DAX plan. So far > the flag in XFS seems like a massive mistake and we need to get > rid of it instead of adding it to more file systems. Hi, pardon my ignorance, but can you point me to the discussion where this was decided ? I must have missed it and I can't find it. Thanks! -Lukas ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-07 12:12 ` Lukas Czerner @ 2017-08-08 9:00 ` Lukas Czerner 2017-08-11 10:01 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Lukas Czerner @ 2017-08-08 9:00 UTC (permalink / raw) To: Christoph Hellwig, linux-ext4 On Mon, Aug 07, 2017 at 02:12:12PM +0200, Lukas Czerner wrote: > On Sat, Aug 05, 2017 at 01:46:01AM -0700, Christoph Hellwig wrote: > > Please don't do this until we have a coherent DAX plan. So far > > the flag in XFS seems like a massive mistake and we need to get > > rid of it instead of adding it to more file systems. > > Hi, pardon my ignorance, but can you point me to the discussion where > this was decided ? I must have missed it and I can't find it. > > Thanks! > -Lukas So I've read the thread where you talk about this. I do not see anything about it being a masive mistake. I only see that you're confident that's it's a wrong thing to do. Like you're ever not confident about something;) Care to elaborate on what's wrong with it and why we do not want it ? On the surface it does look like a usefull option to have. -Lukas ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-08 9:00 ` Lukas Czerner @ 2017-08-11 10:01 ` Christoph Hellwig 2017-08-11 12:11 ` Lukas Czerner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2017-08-11 10:01 UTC (permalink / raw) To: Christoph Hellwig, linux-ext4 On Tue, Aug 08, 2017 at 11:00:16AM +0200, Lukas Czerner wrote: > So I've read the thread where you talk about this. I do not see anything > about it being a masive mistake. I only see that you're confident that's > it's a wrong thing to do. Like you're ever not confident about > something;) Care to elaborate on what's wrong with it and why we do not > want it ? On the surface it does look like a usefull option to have. The problem is that DAX is an implementation detail on how to write data back. It has absolutely no user visible semantics. Encoding such a detail in the on-disk format is not a good idea. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-11 10:01 ` Christoph Hellwig @ 2017-08-11 12:11 ` Lukas Czerner 2017-08-11 12:58 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Lukas Czerner @ 2017-08-11 12:11 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-ext4 On Fri, Aug 11, 2017 at 03:01:47AM -0700, Christoph Hellwig wrote: > On Tue, Aug 08, 2017 at 11:00:16AM +0200, Lukas Czerner wrote: > > So I've read the thread where you talk about this. I do not see anything > > about it being a masive mistake. I only see that you're confident that's > > it's a wrong thing to do. Like you're ever not confident about > > something;) Care to elaborate on what's wrong with it and why we do not > > want it ? On the surface it does look like a usefull option to have. > > The problem is that DAX is an implementation detail on how to write > data back. It has absolutely no user visible semantics. Encoding > such a detail in the on-disk format is not a good idea. Thanks, for the answer. I do not know too much about the DAX enabled HW. However I do know that there is some variety to it, some can be faster than DRAM, some can be slower, or on-par with DRAM. Some can be more expensive, hence probably smaller, some cheaper and bigger. What about latency and throughput can there be difference as well ? That said, it seems to me that there can be some user choice involved in this at least based on the fact that when DAX is used system memory is not used. So for example when DAX HW is slower than system memory, user can make a choice to exclude some inodes to speed up particular workload, while saving system memory where it does not matter as much. Also can this flag play a role in situation of hierarchical, or heterogeneous storage where dax enabled hardware is used ? Seems to me that it might. What I am trying to say is that while you say that it has no user visible semantics, it does have effect on the system that should not be simply ignored. -Lukas ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-11 12:11 ` Lukas Czerner @ 2017-08-11 12:58 ` Christoph Hellwig 2017-08-11 13:41 ` Lukas Czerner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2017-08-11 12:58 UTC (permalink / raw) To: Christoph Hellwig, linux-ext4 On Fri, Aug 11, 2017 at 02:11:32PM +0200, Lukas Czerner wrote: > Thanks, for the answer. I do not know too much about the DAX enabled HW. > However I do know that there is some variety to it, some can be faster > than DRAM, some can be slower, or on-par with DRAM. Some can be more > expensive, hence probably smaller, some cheaper and bigger. What about > latency and throughput can there be difference as well ? Or it could be similar for reads and much slower for writes. Or it might prefer larger access sizes than a cache line. > That said, it seems to me that there can be some user choice involved in > this at least based on the fact that when DAX is used system memory is not > used. All of the above are charateristics of the medium, not of the application. > So for example when DAX HW is slower than system memory, user can make > a choice to exclude some inodes to speed up particular workload, while > saving system memory where it does not matter as much. What should be factored into the decision might be access hints from the applications, which would be things like madvise/fadvise hints. But the application should not even have to know about something like DAX, and the detailed access characterisics of the medium. And even more importantly we should not encode those complex characterisitcs (see above) into a binary flag in the on-disk format. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-11 12:58 ` Christoph Hellwig @ 2017-08-11 13:41 ` Lukas Czerner 2017-08-24 18:20 ` Theodore Ts'o 0 siblings, 1 reply; 24+ messages in thread From: Lukas Czerner @ 2017-08-11 13:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-ext4 On Fri, Aug 11, 2017 at 05:58:49AM -0700, Christoph Hellwig wrote: > On Fri, Aug 11, 2017 at 02:11:32PM +0200, Lukas Czerner wrote: > > Thanks, for the answer. I do not know too much about the DAX enabled HW. > > However I do know that there is some variety to it, some can be faster > > than DRAM, some can be slower, or on-par with DRAM. Some can be more > > expensive, hence probably smaller, some cheaper and bigger. What about > > latency and throughput can there be difference as well ? > > Or it could be similar for reads and much slower for writes. Or it > might prefer larger access sizes than a cache line. > > > That said, it seems to me that there can be some user choice involved in > > this at least based on the fact that when DAX is used system memory is not > > used. > > All of the above are charateristics of the medium, not of the > application. > > > > So for example when DAX HW is slower than system memory, user can make > > a choice to exclude some inodes to speed up particular workload, while > > saving system memory where it does not matter as much. > > What should be factored into the decision might be access hints from > the applications, which would be things like madvise/fadvise hints. > > But the application should not even have to know about something like > DAX, and the detailed access characterisics of the medium. And even > more importantly we should not encode those complex characterisitcs > (see above) into a binary flag in the on-disk format. I understand your concept, I even agree that application should not even know, or have to know, about DAX, or any characteristics of the medium. Application can take advantage of madvise/fadvise hints, if implemented in this case and it will be usefull. But I am not talking about the application here. Administrator is the one setting up the environment and no matter what are the capabilities of the application he (it) can and should have the information about the the system (and the medium characteristics) and should be able to tweak it. Per inode DAX flags give him the ability to do so permanently in a convenient way in this case. We have plenty of knobs in the file system we can play with to optimize it for a given medium already. Sometimes we maybe go too far, but I do not think it's the case here. Especially since I do not see a way to implement the above mentioned use case in a different way - without relying on the application to do the right thing with hints (when those are implemented). -Lukas ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-11 13:41 ` Lukas Czerner @ 2017-08-24 18:20 ` Theodore Ts'o 2017-08-25 7:54 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Theodore Ts'o @ 2017-08-24 18:20 UTC (permalink / raw) To: Christoph Hellwig, linux-ext4; +Cc: Lukas Czerner On Fri, Aug 11, 2017 at 03:41:30PM +0200, Lukas Czerner wrote: > > > That said, it seems to me that there can be some user choice involved in > > > this at least based on the fact that when DAX is used system memory is not > > > used. > > > > All of the above are charateristics of the medium, not of the > > application. So it seems that Cristoph's primary object to using a per-inode DAX flag is that it is a not-very-well-defined hint to the file system about how to treat writes for a class of storage devices which do not yet have (and perhaps may never have) a standard set of performance characteristics. So if we encode this into the file system format, we'll be stuck with a "do something different" set of semantics that xfs (and ext4 if we pick up this patch) will have to support forever. Or, at least, if we make changes that cause performance impact to userspace applications, this may cause application programmers to kvetch --- not that they never done *that* before. :-) The counter-argument is that system administrators do need to have a way to signal that they would like the file system to "do something different" on a per-file basis, and no one else has come up with another way of doing things. Furthermore, it would be highly desirable if the system adminisator can provide this per-file system hint with requiring changes to the application. (For example, by adding madvise/fadvise hints.) Is that a fair summary of the argument? I have two additional questions I'd like to ask at this point. 1) Has there been any other difficulty that XFS has had due to the fact that they have this DAX flag added? e.g., are there any operational, or practical code maintainability issues at stake here? Or is this mostly an design philosophy debate? 2) Are there any users using the DAX flag with XFS such that, if XFS were to remove the DAX flag support, those users would complain bitterly? Thanks, - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-24 18:20 ` Theodore Ts'o @ 2017-08-25 7:54 ` Christoph Hellwig 2017-08-25 15:14 ` Theodore Ts'o 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2017-08-25 7:54 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Christoph Hellwig, linux-ext4, Lukas Czerner, linux-xfs On Thu, Aug 24, 2017 at 02:20:57PM -0400, Theodore Ts'o wrote: > The counter-argument is that system administrators do need to have a > way to signal that they would like the file system to "do something > different" on a per-file basis, and no one else has come up with > another way of doing things. Furthermore, it would be highly > desirable if the system adminisator can provide this per-file system > hint with requiring changes to the application. (For example, by > adding madvise/fadvise hints.) We can always add some sort of inode or subtree advice. It's just the binary flag that encodes a specific implementation that is very bad in the long run. > Is that a fair summary of the argument? Otherwise yet. > I have two additional questions I'd like to ask at this point. > > 1) Has there been any other difficulty that XFS has had due to the > fact that they have this DAX flag added? e.g., are there any > operational, or practical code maintainability issues at stake here? > Or is this mostly an design philosophy debate? It hasn't yet. It will create really annoying problems once we use raw DAX access for metadata, which I had prototype a while ago and plan to finnally get in in the next months. > 2) Are there any users using the DAX flag with XFS such that, if XFS > were to remove the DAX flag support, those users would complain > bitterly? I don't know of anyone that actually uses the flag. If someone did that would probably run into problems like changing that changing it on a file that's currently mmaped would crash an burn badly. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-25 7:54 ` Christoph Hellwig @ 2017-08-25 15:14 ` Theodore Ts'o 2017-08-25 15:40 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Theodore Ts'o @ 2017-08-25 15:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-ext4, Lukas Czerner, linux-xfs On Fri, Aug 25, 2017 at 12:54:15AM -0700, Christoph Hellwig wrote: > > 1) Has there been any other difficulty that XFS has had due to the > > fact that they have this DAX flag added? e.g., are there any > > operational, or practical code maintainability issues at stake here? > > Or is this mostly an design philosophy debate? > > It hasn't yet. It will create really annoying problems once we > use raw DAX access for metadata, which I had prototype a while ago > and plan to finnally get in in the next months. > > > 2) Are there any users using the DAX flag with XFS such that, if XFS > > were to remove the DAX flag support, those users would complain > > bitterly? > > I don't know of anyone that actually uses the flag. If someone did > that would probably run into problems like changing that changing it > on a file that's currently mmaped would crash an burn badly. Crash and burn meaning the *kernel* will crash and burn? Or will data be damaged? Given that, maybe XFS should withdraw support for the DAX and hope no one is actually using it? - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-25 15:14 ` Theodore Ts'o @ 2017-08-25 15:40 ` Christoph Hellwig 2017-08-25 16:28 ` Theodore Ts'o 2017-08-25 23:33 ` Dave Chinner 0 siblings, 2 replies; 24+ messages in thread From: Christoph Hellwig @ 2017-08-25 15:40 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Christoph Hellwig, linux-ext4, Lukas Czerner, linux-xfs On Fri, Aug 25, 2017 at 11:14:45AM -0400, Theodore Ts'o wrote: > > I don't know of anyone that actually uses the flag. If someone did > > that would probably run into problems like changing that changing it > > on a file that's currently mmaped would crash an burn badly. > > Crash and burn meaning the *kernel* will crash and burn? Or will data > be damaged? > > Given that, maybe XFS should withdraw support for the DAX and hope no > one is actually using it? Nah, -o dax works very well. It's just the flag instead of the -o dax option or rather switching it on a mapped file will probably be very dangerous. > > - Ted > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-25 15:40 ` Christoph Hellwig @ 2017-08-25 16:28 ` Theodore Ts'o 2017-08-25 23:33 ` Dave Chinner 1 sibling, 0 replies; 24+ messages in thread From: Theodore Ts'o @ 2017-08-25 16:28 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-ext4, Lukas Czerner, linux-xfs On Fri, Aug 25, 2017 at 08:40:32AM -0700, Christoph Hellwig wrote: > On Fri, Aug 25, 2017 at 11:14:45AM -0400, Theodore Ts'o wrote: > > > I don't know of anyone that actually uses the flag. If someone did > > > that would probably run into problems like changing that changing it > > > on a file that's currently mmaped would crash an burn badly. > > > > Crash and burn meaning the *kernel* will crash and burn? Or will data > > be damaged? > > > > Given that, maybe XFS should withdraw support for the DAX and hope no > > one is actually using it? > > Nah, -o dax works very well. It's just the flag instead of the -o dax > option or rather switching it on a mapped file will probably be very dangerous. Sorry, I was suggested XFS withdrawing support for the pre-file dax flag. I thought part of the argument for why ext4 should accept this patch is that XFS already has the per-file DAX flag? - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-25 15:40 ` Christoph Hellwig 2017-08-25 16:28 ` Theodore Ts'o @ 2017-08-25 23:33 ` Dave Chinner 2017-08-28 7:38 ` Christoph Hellwig 1 sibling, 1 reply; 24+ messages in thread From: Dave Chinner @ 2017-08-25 23:33 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Fri, Aug 25, 2017 at 08:40:32AM -0700, Christoph Hellwig wrote: > On Fri, Aug 25, 2017 at 11:14:45AM -0400, Theodore Ts'o wrote: > > > I don't know of anyone that actually uses the flag. If someone did > > > that would probably run into problems like changing that changing it > > > on a file that's currently mmaped would crash an burn badly. > > > > Crash and burn meaning the *kernel* will crash and burn? Or will data > > be damaged? > > > > Given that, maybe XFS should withdraw support for the DAX and hope no > > one is actually using it? > > Nah, -o dax works very well. It's just the flag instead of the -o dax > option or rather switching it on a mapped file will probably be very dangerous. In what way is it dangerous, Christoph? Changing to/form DAX is simply a case of syncing the data and then invalidating the existing mappings and letting the new faults go direct or through the page cache. The mapping invalidation requirements are no different to what we need to do to punch a hole in a mapped file... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-25 23:33 ` Dave Chinner @ 2017-08-28 7:38 ` Christoph Hellwig 2017-08-28 10:10 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2017-08-28 7:38 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > option or rather switching it on a mapped file will probably be very dangerous. > > In what way is it dangerous, Christoph? When I run the following script as a normal user: FSXDIR=~/xfstests/ltp/ FILE=/mnt/foo ${FSXDIR}/fsx $FILE & while true; do xfs_io -c 'chattr +x' $FILE xfs_io -c 'chattr -x' $FILE done I get this nice little crash: root@testvm:~# sh test.sh skipping zero size read skipping insert range behind EOF truncating to largest ever: 0x3a290 zero_range to largest ever: 0x3a8d1 zero_range to largest ever: 0x3fe3e zero_range to largest ever: 0x40000 [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 [ 344.899795] PGD 7db37067 [ 344.899796] P4D 7db37067 [ 344.900099] PUD 78c61067 [ 344.900389] PMD 0 [ 344.900665] [ 344.901075] Oops: 0000 [#1] SMP [ 344.901536] Modules linked in: [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 [ 344.901716] FS: 00007f76bc863700(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000 [ 344.901716] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 344.901716] CR2: 0000000000000020 CR3: 000000007d90b000 CR4: 00000000000006e0 [ 344.901716] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 344.901716] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 344.901716] Call Trace: [ 344.901716] xfs_filemap_page_mkwrite+0x90/0x1d0 [ 344.901716] xfs_filemap_fault+0x114/0x180 [ 344.901716] __do_fault+0x19/0x60 [ 344.901716] __handle_mm_fault+0x68f/0xaf0 [ 344.901716] handle_mm_fault+0x82/0x130 [ 344.901716] __do_page_fault+0x222/0x4d0 [ 344.901716] trace_do_page_fault+0x85/0x210 [ 344.901716] do_async_page_fault+0x14/0x60 [ 344.901716] async_page_fault+0x28/0x30 [ 344.901716] RIP: 0033:0x7f76bbd14049 [ 344.901716] RSP: 002b:00007ffcdd6c7fe8 EFLAGS: 00010206 [ 344.901716] RAX: 000000000002f144 RBX: 0000000000000fd9 RCX: 000000000000fd98 [ 344.901716] RDX: 0000000000007ecc RSI: 00007f76bc80efe9 RDI: 00007f76bc7d7fd9 [ 344.901716] RBP: 00007f76bc7d7000 R08: 0000000000000003 R09: 000000000002e000 [ 344.901716] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000007ecc [ 344.901716] R13: 000000000002efd9 R14: 0000000000008ea5 R15: 0000000000000000 [ 344.901716] Code: 50 ff ff ff 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f4 53 4c 8b 6f 48 48 8b 07 <49> 8b 55 20 48 8b 80 a0 00 00 00 4c 8b 70 20 48 8d 42 ff 83 e2 [ 344.901716] RIP: iomap_page_mkwrite+0x17/0xf0 RSP: ffffc900068a7d38 [ 344.901716] CR2: 0000000000000020 [ 344.924264] ---[ end trace c4e8d3bdccf6912b ]--- [ 344.924834] Kernel panic - not syncing: Fatal exception [ 344.925504] Kernel Offset: disabled [ 344.925884] ---[ end Kernel panic - not syncing: Fatal exception ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-28 7:38 ` Christoph Hellwig @ 2017-08-28 10:10 ` Dave Chinner 2017-08-29 15:49 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2017-08-28 10:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote: > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > > option or rather switching it on a mapped file will probably be very dangerous. > > > > In what way is it dangerous, Christoph? > > When I run the following script as a normal user: > > FSXDIR=~/xfstests/ltp/ > FILE=/mnt/foo > > ${FSXDIR}/fsx $FILE & > > while true; do > xfs_io -c 'chattr +x' $FILE > xfs_io -c 'chattr -x' $FILE > done > > I get this nice little crash: Can you please package that up into an xfstest? > root@testvm:~# sh test.sh > skipping zero size read > skipping insert range behind EOF > truncating to largest ever: 0x3a290 > zero_range to largest ever: 0x3a8d1 > zero_range to largest ever: 0x3fe3e > zero_range to largest ever: 0x40000 > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 > [ 344.899795] PGD 7db37067 > [ 344.899796] P4D 7db37067 > [ 344.900099] PUD 78c61067 > [ 344.900389] PMD 0 > [ 344.900665] > [ 344.901075] Oops: 0000 [#1] SMP > [ 344.901536] Modules linked in: > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 ^^^^^^^^^^^^^^^^ vmf->page is null. Which means IS_DAX changed half way through a fault, despite us holding the MMAPLOCK and protecting all the filesystem side of the fault code from races. Seems to me that even allowing filesystems to switch between different mapping tree behaviours based on an inode flag is a fundamentally broken model. The fault action that needs to taken by the filesystem has already been predetermined by the fault processing that has already occurred and placed into the contents of the vmf we've been passed. Hence I think that if we need to process the fault as a DAX fault then the vmf needs to tell us that, not require us to look up an inode flag to determine what to do. ANd if the inode flag changes, then that needs to be propagated through the mapping and VMAs in a sane fashion, not just run an invalidation from the filesystem. I don't know enough about the VM code to say anything useful about how this needs to be set up, but it's clear that mapping invalidation and behaviour swaps can't be completely serialised against page faults from the filesystem side. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-28 10:10 ` Dave Chinner @ 2017-08-29 15:49 ` Jan Kara 2017-08-29 22:57 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Jan Kara @ 2017-08-29 15:49 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Mon 28-08-17 20:10:14, Dave Chinner wrote: > On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote: > > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > > > option or rather switching it on a mapped file will probably be very dangerous. > > > > > > In what way is it dangerous, Christoph? > > > > When I run the following script as a normal user: > > > > FSXDIR=~/xfstests/ltp/ > > FILE=/mnt/foo > > > > ${FSXDIR}/fsx $FILE & > > > > while true; do > > xfs_io -c 'chattr +x' $FILE > > xfs_io -c 'chattr -x' $FILE > > done > > > > I get this nice little crash: > > Can you please package that up into an xfstest? > > > root@testvm:~# sh test.sh > > skipping zero size read > > skipping insert range behind EOF > > truncating to largest ever: 0x3a290 > > zero_range to largest ever: 0x3a8d1 > > zero_range to largest ever: 0x3fe3e > > zero_range to largest ever: 0x40000 > > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 > > [ 344.899795] PGD 7db37067 > > [ 344.899796] P4D 7db37067 > > [ 344.900099] PUD 78c61067 > > [ 344.900389] PMD 0 > > [ 344.900665] > > [ 344.901075] Oops: 0000 [#1] SMP > > [ 344.901536] Modules linked in: > > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 > > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 > > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 > > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 > > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 > > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 > > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 > > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 > > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 > ^^^^^^^^^^^^^^^^ > > vmf->page is null. > > Which means IS_DAX changed half way through a fault, despite us > holding the MMAPLOCK and protecting all the filesystem side of the > fault code from races. > > Seems to me that even allowing filesystems to switch between > different mapping tree behaviours based on an inode flag is a > fundamentally broken model. The fault action that needs to taken by > the filesystem has already been predetermined by the fault > processing that has already occurred and placed into the contents of > the vmf we've been passed. I don't think the problem is actually within MM in this particular case. The problem seems to be that xfs_filemap_fault() checks IS_DAX without holding MMAPLOCK and so it can change after that test and before the test in xfs_filemap_page_mkwrite(). > Hence I think that if we need to process the fault as a DAX fault > then the vmf needs to tell us that, not require us to look up an > inode flag to determine what to do. ANd if the inode flag changes, > then that needs to be propagated through the mapping and VMAs in a > sane fashion, not just run an invalidation from the filesystem. I > don't know enough about the VM code to say anything useful about how > this needs to be set up, but it's clear that mapping invalidation > and behaviour swaps can't be completely serialised against page > faults from the filesystem side. But there is no difference in vmf setup from generic MM side. In particular vmf->page is set by the ->fault handler and then it is passed to ->page_mkwrite handler. And changes to mapping behavior between these two callbacks should be prevented by the page lock / radix entry lock... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-29 15:49 ` Jan Kara @ 2017-08-29 22:57 ` Dave Chinner 2017-08-30 10:00 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2017-08-29 22:57 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Tue, Aug 29, 2017 at 05:49:22PM +0200, Jan Kara wrote: > On Mon 28-08-17 20:10:14, Dave Chinner wrote: > > On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote: > > > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > > > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > > > > option or rather switching it on a mapped file will probably be very dangerous. > > > > > > > > In what way is it dangerous, Christoph? > > > > > > When I run the following script as a normal user: > > > > > > FSXDIR=~/xfstests/ltp/ > > > FILE=/mnt/foo > > > > > > ${FSXDIR}/fsx $FILE & > > > > > > while true; do > > > xfs_io -c 'chattr +x' $FILE > > > xfs_io -c 'chattr -x' $FILE > > > done > > > > > > I get this nice little crash: > > > > Can you please package that up into an xfstest? > > > > > root@testvm:~# sh test.sh > > > skipping zero size read > > > skipping insert range behind EOF > > > truncating to largest ever: 0x3a290 > > > zero_range to largest ever: 0x3a8d1 > > > zero_range to largest ever: 0x3fe3e > > > zero_range to largest ever: 0x40000 > > > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 > > > [ 344.899795] PGD 7db37067 > > > [ 344.899796] P4D 7db37067 > > > [ 344.900099] PUD 78c61067 > > > [ 344.900389] PMD 0 > > > [ 344.900665] > > > [ 344.901075] Oops: 0000 [#1] SMP > > > [ 344.901536] Modules linked in: > > > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 > > > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > > > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 > > > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 > > > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 > > > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 > > > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 > > > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 > > > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 > > > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 > > ^^^^^^^^^^^^^^^^ > > > > vmf->page is null. > > > > Which means IS_DAX changed half way through a fault, despite us > > holding the MMAPLOCK and protecting all the filesystem side of the > > fault code from races. > > > > Seems to me that even allowing filesystems to switch between > > different mapping tree behaviours based on an inode flag is a > > fundamentally broken model. The fault action that needs to taken by > > the filesystem has already been predetermined by the fault > > processing that has already occurred and placed into the contents of > > the vmf we've been passed. > > I don't think the problem is actually within MM in this particular case. > The problem seems to be that xfs_filemap_fault() checks IS_DAX without > holding MMAPLOCK and so it can change after that test and before the test > in xfs_filemap_page_mkwrite(). First thing I did was fix that, and it still paniced with vmf->page = null in iomap_page_mkwrite, then I realised it was irrelevant because if it can change between xfs_filemap_fault() and xfs_filemap_page_mkwrite() it can change anytime in the fault path we are not holding the MMAPLOCK.... > > Hence I think that if we need to process the fault as a DAX fault > > then the vmf needs to tell us that, not require us to look up an > > inode flag to determine what to do. ANd if the inode flag changes, > > then that needs to be propagated through the mapping and VMAs in a > > sane fashion, not just run an invalidation from the filesystem. I > > don't know enough about the VM code to say anything useful about how > > this needs to be set up, but it's clear that mapping invalidation > > and behaviour swaps can't be completely serialised against page > > faults from the filesystem side. > > But there is no difference in vmf setup from generic MM side. In particular > vmf->page is set by the ->fault handler and then it is passed to > ->page_mkwrite handler. Yes there is. DAX can call the .fault() handler directly from handle_pte_fault() for write faults on DAX when there is no pte installed. In which case vmf->page is null because we don't go through do_wp_page() to install the page we faulted on in the vmf before calling .page_mkwrite(). IOWs, if we fault between the invalidation in the IS_DAX changing code and dropping the MMAPLOCK once the transaction inode change is complete, then that fault will find an empty pte. It will then call ->filemap_fault with vmf->page = NULL, and block on the MMAPLOCK until the filesystem transaction is complete. Then it checks IS_DAX, finds it's not set, so assumes that we have a page cache fault, not a DAX fault.... It appears to me that we can't prevent this invalidation vs fault race condition with filesystem-only locking and that means we can't use filesystem side configuration flags to determine what action to take in the filesystem side of the fault path. I have no idea what mm/ side lock is needed to prevent this invalidation vs fault race condition from occurring, nor whether it is even possible to take and hold that lock in the filesysetm code where we would need to. > And changes to mapping behavior between these two > callbacks should be prevented by the page lock / radix entry lock... No page, no page lock. We're above the mapping code, too, so no radix entry lock is held here, either... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-29 22:57 ` Dave Chinner @ 2017-08-30 10:00 ` Jan Kara 2017-08-30 12:34 ` Christoph Hellwig 2017-08-30 16:05 ` Jan Kara 0 siblings, 2 replies; 24+ messages in thread From: Jan Kara @ 2017-08-30 10:00 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Wed 30-08-17 08:57:17, Dave Chinner wrote: > On Tue, Aug 29, 2017 at 05:49:22PM +0200, Jan Kara wrote: > > On Mon 28-08-17 20:10:14, Dave Chinner wrote: > > > On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote: > > > > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > > > > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > > > > > option or rather switching it on a mapped file will probably be very dangerous. > > > > > > > > > > In what way is it dangerous, Christoph? > > > > > > > > When I run the following script as a normal user: > > > > > > > > FSXDIR=~/xfstests/ltp/ > > > > FILE=/mnt/foo > > > > > > > > ${FSXDIR}/fsx $FILE & > > > > > > > > while true; do > > > > xfs_io -c 'chattr +x' $FILE > > > > xfs_io -c 'chattr -x' $FILE > > > > done > > > > > > > > I get this nice little crash: > > > > > > Can you please package that up into an xfstest? > > > > > > > root@testvm:~# sh test.sh > > > > skipping zero size read > > > > skipping insert range behind EOF > > > > truncating to largest ever: 0x3a290 > > > > zero_range to largest ever: 0x3a8d1 > > > > zero_range to largest ever: 0x3fe3e > > > > zero_range to largest ever: 0x40000 > > > > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > > > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 > > > > [ 344.899795] PGD 7db37067 > > > > [ 344.899796] P4D 7db37067 > > > > [ 344.900099] PUD 78c61067 > > > > [ 344.900389] PMD 0 > > > > [ 344.900665] > > > > [ 344.901075] Oops: 0000 [#1] SMP > > > > [ 344.901536] Modules linked in: > > > > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 > > > > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > > > > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 > > > > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 > > > > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 > > > > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 > > > > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 > > > > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 > > > > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 > > > > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 > > > ^^^^^^^^^^^^^^^^ > > > > > > vmf->page is null. > > > > > > Which means IS_DAX changed half way through a fault, despite us > > > holding the MMAPLOCK and protecting all the filesystem side of the > > > fault code from races. > > > > > > Seems to me that even allowing filesystems to switch between > > > different mapping tree behaviours based on an inode flag is a > > > fundamentally broken model. The fault action that needs to taken by > > > the filesystem has already been predetermined by the fault > > > processing that has already occurred and placed into the contents of > > > the vmf we've been passed. > > > > I don't think the problem is actually within MM in this particular case. > > The problem seems to be that xfs_filemap_fault() checks IS_DAX without > > holding MMAPLOCK and so it can change after that test and before the test > > in xfs_filemap_page_mkwrite(). > > First thing I did was fix that, and it still paniced with vmf->page > = null in iomap_page_mkwrite, then I realised it was irrelevant OK, drat. > because if it can change between xfs_filemap_fault() and > xfs_filemap_page_mkwrite() it can change anytime in the fault path > we are not holding the MMAPLOCK.... Agreed but... > > > Hence I think that if we need to process the fault as a DAX fault > > > then the vmf needs to tell us that, not require us to look up an > > > inode flag to determine what to do. ANd if the inode flag changes, > > > then that needs to be propagated through the mapping and VMAs in a > > > sane fashion, not just run an invalidation from the filesystem. I > > > don't know enough about the VM code to say anything useful about how > > > this needs to be set up, but it's clear that mapping invalidation > > > and behaviour swaps can't be completely serialised against page > > > faults from the filesystem side. > > > > But there is no difference in vmf setup from generic MM side. In particular > > vmf->page is set by the ->fault handler and then it is passed to > > ->page_mkwrite handler. > > Yes there is. DAX can call the .fault() handler directly from > handle_pte_fault() for write faults on DAX when there is no pte > installed. In which case vmf->page is null because we don't go > through do_wp_page() to install the page we faulted on in the vmf > before calling .page_mkwrite(). > > IOWs, if we fault between the invalidation in the IS_DAX changing > code and dropping the MMAPLOCK once the transaction inode change is > complete, then that fault will find an empty pte. It will then call > ->filemap_fault with vmf->page = NULL, and block on the MMAPLOCK > until the filesystem transaction is complete. Then it checks > IS_DAX, finds it's not set, so assumes that we have a page cache > fault, not a DAX fault.... So if you get to xfs_filemap_fault(), lock MMAPLOCK, then check IS_DAX() - not set - then you go to filemap_fault() which is perfectly fine with vmf->page being NULL. What am I missing? The oops can happen if you call iomap_page_mkwrite() with vmf->page set to NULL. However I don't see how that is possible after all IS_DAX checks are under MMAPLOCK - iomap_page_mkwrite() cannot be called through xfs_filemap_fault() anymore once DAX checks are consistent. And if it gets called through xfs_filemap_page_mkwrite(), it means that xfs_filemap_fault() was called before and must have returned VM_FAULT_LOCKED which means that it has set vmf->page to a correct page. [sorry, the page lock and entry lock I was speaking about in my previous email was indeed wrong] I've tried to reproduce this on my machine but for some reason setting of per-inode DAX flag does not work. I have (with today's checkout of xfsprogs): marvin5:~/:[0]# cat /proc/mounts ... /dev/ram0 /mnt xfs rw,relatime,attr2,dax,inode64,noquota 0 0 marvin5:~/:[0]# ls /mnt/ fsxfile fsxfile.fsxgood fsxfile.fsxlog marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile -p-------------- /mnt/fsxfile marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'chattr +x' /mnt/fsxfile marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile -p-------------- /mnt/fsxfile No DAX flag set and no error... What am I doing wrong? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-30 10:00 ` Jan Kara @ 2017-08-30 12:34 ` Christoph Hellwig 2017-08-30 15:00 ` Theodore Ts'o 2017-08-30 15:29 ` Jan Kara 2017-08-30 16:05 ` Jan Kara 1 sibling, 2 replies; 24+ messages in thread From: Christoph Hellwig @ 2017-08-30 12:34 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Wed, Aug 30, 2017 at 12:00:59PM +0200, Jan Kara wrote: > /dev/ram0 /mnt xfs rw,relatime,attr2,dax,inode64,noquota 0 0 > > marvin5:~/:[0]# ls /mnt/ > fsxfile fsxfile.fsxgood fsxfile.fsxlog > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > -p-------------- /mnt/fsxfile > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'chattr +x' /mnt/fsxfile > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > -p-------------- /mnt/fsxfile > > No DAX flag set and no error... What am I doing wrong? Odd. Even on a non-DAX device the flag sticks for me: root@testvm:~# mount /dev/sda /mnt/ root@testvm:~# touch foo root@testvm:~# cd /mnt/ root@testvm:/mnt# xfs_io -c 'chattr +x' foo root@testvm:/mnt# xfs_io -c 'lsattr' foo --------------x- foo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-30 12:34 ` Christoph Hellwig @ 2017-08-30 15:00 ` Theodore Ts'o 2017-08-30 15:30 ` Lukas Czerner 2017-08-30 15:29 ` Jan Kara 1 sibling, 1 reply; 24+ messages in thread From: Theodore Ts'o @ 2017-08-30 15:00 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Dave Chinner, linux-ext4, Lukas Czerner, linux-xfs Bringing this discussion back to the "ext4: introduce per-inode DAX flag" patch, it sounds to me that I should _not_ apply this patch until we figure out to fix this for reals. Does this sound fair? If we think it's completely unfixable, then maybe XFS should consider (temporarily) withdrawing the per-file DAX support before there is a userspace and you have to support it forever, but that's between the NVM/dax and XFS folks to figure out. - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-30 15:00 ` Theodore Ts'o @ 2017-08-30 15:30 ` Lukas Czerner 0 siblings, 0 replies; 24+ messages in thread From: Lukas Czerner @ 2017-08-30 15:30 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-ext4, linux-xfs On Wed, Aug 30, 2017 at 11:00:19AM -0400, Theodore Ts'o wrote: > Bringing this discussion back to the "ext4: introduce per-inode DAX > flag" patch, it sounds to me that I should _not_ apply this patch > until we figure out to fix this for reals. Does this sound fair? > > If we think it's completely unfixable, then maybe XFS should consider > (temporarily) withdrawing the per-file DAX support before there is a > userspace and you have to support it forever, but that's between the > NVM/dax and XFS folks to figure out. > > - Ted Hi Ted, I agree, this still has some issues even for ext4. Let's see what we can do about that and I'll revisit it later. Thanks! -Lukas ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-30 12:34 ` Christoph Hellwig 2017-08-30 15:00 ` Theodore Ts'o @ 2017-08-30 15:29 ` Jan Kara 1 sibling, 0 replies; 24+ messages in thread From: Jan Kara @ 2017-08-30 15:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Dave Chinner, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Wed 30-08-17 05:34:50, Christoph Hellwig wrote: > On Wed, Aug 30, 2017 at 12:00:59PM +0200, Jan Kara wrote: > > /dev/ram0 /mnt xfs rw,relatime,attr2,dax,inode64,noquota 0 0 > > > > marvin5:~/:[0]# ls /mnt/ > > fsxfile fsxfile.fsxgood fsxfile.fsxlog > > > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > > -p-------------- /mnt/fsxfile > > > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'chattr +x' /mnt/fsxfile > > > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > > -p-------------- /mnt/fsxfile > > > > No DAX flag set and no error... What am I doing wrong? > > Odd. Even on a non-DAX device the flag sticks for me: > > root@testvm:~# mount /dev/sda /mnt/ > root@testvm:~# touch foo > root@testvm:~# cd /mnt/ > root@testvm:/mnt# xfs_io -c 'chattr +x' foo > root@testvm:/mnt# xfs_io -c 'lsattr' foo > --------------x- foo OK, tracked this down. I had the XFS filesystem created so that ip->i_d.di_version < 3 check in xfs_set_diflags() was true and so the flag got silently ignored. I'd hope to get error in such case but whatever... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] ext4: introduce per-inode DAX flag 2017-08-30 10:00 ` Jan Kara 2017-08-30 12:34 ` Christoph Hellwig @ 2017-08-30 16:05 ` Jan Kara 1 sibling, 0 replies; 24+ messages in thread From: Jan Kara @ 2017-08-30 16:05 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Christoph Hellwig, Theodore Ts'o, linux-ext4, Lukas Czerner, linux-xfs On Wed 30-08-17 12:00:59, Jan Kara wrote: > On Wed 30-08-17 08:57:17, Dave Chinner wrote: > > On Tue, Aug 29, 2017 at 05:49:22PM +0200, Jan Kara wrote: > > > On Mon 28-08-17 20:10:14, Dave Chinner wrote: > > > > On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote: > > > > > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote: > > > > > > > Nah, -o dax works very well. It's just the flag instead of the -o dax > > > > > > > option or rather switching it on a mapped file will probably be very dangerous. > > > > > > > > > > > > In what way is it dangerous, Christoph? > > > > > > > > > > When I run the following script as a normal user: > > > > > > > > > > FSXDIR=~/xfstests/ltp/ > > > > > FILE=/mnt/foo > > > > > > > > > > ${FSXDIR}/fsx $FILE & > > > > > > > > > > while true; do > > > > > xfs_io -c 'chattr +x' $FILE > > > > > xfs_io -c 'chattr -x' $FILE > > > > > done > > > > > > > > > > I get this nice little crash: > > > > > > > > Can you please package that up into an xfstest? > > > > > > > > > root@testvm:~# sh test.sh > > > > > skipping zero size read > > > > > skipping insert range behind EOF > > > > > truncating to largest ever: 0x3a290 > > > > > zero_range to largest ever: 0x3a8d1 > > > > > zero_range to largest ever: 0x3fe3e > > > > > zero_range to largest ever: 0x40000 > > > > > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > > > > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0 > > > > > [ 344.899795] PGD 7db37067 > > > > > [ 344.899796] P4D 7db37067 > > > > > [ 344.900099] PUD 78c61067 > > > > > [ 344.900389] PMD 0 > > > > > [ 344.900665] > > > > > [ 344.901075] Oops: 0000 [#1] SMP > > > > > [ 344.901536] Modules linked in: > > > > > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199 > > > > > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > > > > > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000 > > > > > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0 > > > > > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246 > > > > > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001 > > > > > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0 > > > > > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000 > > > > > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010 > > > > > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580 > > > > ^^^^^^^^^^^^^^^^ > > > > > > > > vmf->page is null. > > > > > > > > Which means IS_DAX changed half way through a fault, despite us > > > > holding the MMAPLOCK and protecting all the filesystem side of the > > > > fault code from races. > > > > > > > > Seems to me that even allowing filesystems to switch between > > > > different mapping tree behaviours based on an inode flag is a > > > > fundamentally broken model. The fault action that needs to taken by > > > > the filesystem has already been predetermined by the fault > > > > processing that has already occurred and placed into the contents of > > > > the vmf we've been passed. > > > > > > I don't think the problem is actually within MM in this particular case. > > > The problem seems to be that xfs_filemap_fault() checks IS_DAX without > > > holding MMAPLOCK and so it can change after that test and before the test > > > in xfs_filemap_page_mkwrite(). > > > > First thing I did was fix that, and it still paniced with vmf->page > > = null in iomap_page_mkwrite, then I realised it was irrelevant > > OK, drat. > > > because if it can change between xfs_filemap_fault() and > > xfs_filemap_page_mkwrite() it can change anytime in the fault path > > we are not holding the MMAPLOCK.... > > Agreed but... > > > > > Hence I think that if we need to process the fault as a DAX fault > > > > then the vmf needs to tell us that, not require us to look up an > > > > inode flag to determine what to do. ANd if the inode flag changes, > > > > then that needs to be propagated through the mapping and VMAs in a > > > > sane fashion, not just run an invalidation from the filesystem. I > > > > don't know enough about the VM code to say anything useful about how > > > > this needs to be set up, but it's clear that mapping invalidation > > > > and behaviour swaps can't be completely serialised against page > > > > faults from the filesystem side. > > > > > > But there is no difference in vmf setup from generic MM side. In particular > > > vmf->page is set by the ->fault handler and then it is passed to > > > ->page_mkwrite handler. > > > > Yes there is. DAX can call the .fault() handler directly from > > handle_pte_fault() for write faults on DAX when there is no pte > > installed. In which case vmf->page is null because we don't go > > through do_wp_page() to install the page we faulted on in the vmf > > before calling .page_mkwrite(). > > > > IOWs, if we fault between the invalidation in the IS_DAX changing > > code and dropping the MMAPLOCK once the transaction inode change is > > complete, then that fault will find an empty pte. It will then call > > ->filemap_fault with vmf->page = NULL, and block on the MMAPLOCK > > until the filesystem transaction is complete. Then it checks > > IS_DAX, finds it's not set, so assumes that we have a page cache > > fault, not a DAX fault.... > > So if you get to xfs_filemap_fault(), lock MMAPLOCK, then check IS_DAX() - > not set - then you go to filemap_fault() which is perfectly fine with > vmf->page being NULL. What am I missing? > > The oops can happen if you call iomap_page_mkwrite() with vmf->page set > to NULL. However I don't see how that is possible after all IS_DAX checks > are under MMAPLOCK - iomap_page_mkwrite() cannot be called through > xfs_filemap_fault() anymore once DAX checks are consistent. And if it gets > called through xfs_filemap_page_mkwrite(), it means that > xfs_filemap_fault() was called before and must have returned > VM_FAULT_LOCKED which means that it has set vmf->page to a correct page. > [sorry, the page lock and entry lock I was speaking about in my previous > email was indeed wrong] > > I've tried to reproduce this on my machine but for some reason setting of > per-inode DAX flag does not work. I have (with today's checkout of > xfsprogs): > > marvin5:~/:[0]# cat /proc/mounts > ... > /dev/ram0 /mnt xfs rw,relatime,attr2,dax,inode64,noquota 0 0 > > marvin5:~/:[0]# ls /mnt/ > fsxfile fsxfile.fsxgood fsxfile.fsxlog > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > -p-------------- /mnt/fsxfile > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'chattr +x' /mnt/fsxfile > > marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile > -p-------------- /mnt/fsxfile > > No DAX flag set and no error... What am I doing wrong? OK, after fixing this problem I could easily reproduce. And after some investigation I have to agree it is non-trivial to fix this. I have fixed races between .fault, .page_mkwrite, and flag change relatively easily. However also DAX VMAs need to be setup in a special way (VM_MIXEDMAP set) which obviously does not happen when mmap(2) is called on inode without S_DAX set and just invalidating the mapping does not change the VMAs. We could fix this by making sure file is not mapped when switching the flag but it gets a bit awkward... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2017-08-30 16:05 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-02 16:09 [PATCH] ext4: introduce per-inode DAX flag Lukas Czerner 2017-08-05 8:46 ` Christoph Hellwig 2017-08-07 12:12 ` Lukas Czerner 2017-08-08 9:00 ` Lukas Czerner 2017-08-11 10:01 ` Christoph Hellwig 2017-08-11 12:11 ` Lukas Czerner 2017-08-11 12:58 ` Christoph Hellwig 2017-08-11 13:41 ` Lukas Czerner 2017-08-24 18:20 ` Theodore Ts'o 2017-08-25 7:54 ` Christoph Hellwig 2017-08-25 15:14 ` Theodore Ts'o 2017-08-25 15:40 ` Christoph Hellwig 2017-08-25 16:28 ` Theodore Ts'o 2017-08-25 23:33 ` Dave Chinner 2017-08-28 7:38 ` Christoph Hellwig 2017-08-28 10:10 ` Dave Chinner 2017-08-29 15:49 ` Jan Kara 2017-08-29 22:57 ` Dave Chinner 2017-08-30 10:00 ` Jan Kara 2017-08-30 12:34 ` Christoph Hellwig 2017-08-30 15:00 ` Theodore Ts'o 2017-08-30 15:30 ` Lukas Czerner 2017-08-30 15:29 ` Jan Kara 2017-08-30 16:05 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.