* [PATCH v2 0/2] Filename wipeout patch series updates @ 2021-04-07 15:42 Leah Rumancik 2021-04-07 15:42 ` [PATCH v2 1/2] ext4: wipe filename upon file deletion Leah Rumancik 2021-04-07 15:42 ` [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL Leah Rumancik 0 siblings, 2 replies; 16+ messages in thread From: Leah Rumancik @ 2021-04-07 15:42 UTC (permalink / raw) To: linux-ext4; +Cc: Leah Rumancik [1/2] ext4: wipe filename upon file deletion: - removed mount option for filename wipe, now filename wipe is default behavior - added wiping of file type at time of filename wipe [2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL: - moved ioctl definition to include/uapi/linux/fs.h - renamed ioctl FS_IOC_CHKPT_JRNL - updated to require admin privileges to call ioctl - updated ioctl to take _u64 - updated jbd2_journal_flush to take flags argument to allow for discard flag (kernel: JBD2_FLAG_DO_DISCARD / userspace: CHKPT_JRNL_DO_DISCARD) Leah Rumancik (2): ext4: wipe filename upon file deletion ext4: add ioctl FS_IOC_CHKPT_JRNL fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 4 +- fs/ext4/ioctl.c | 34 ++++++++++++-- fs/ext4/namei.c | 4 ++ fs/ext4/super.c | 6 +-- fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++- fs/ocfs2/alloc.c | 2 +- fs/ocfs2/journal.c | 8 ++-- include/linux/jbd2.h | 5 +- include/uapi/linux/fs.h | 4 ++ 10 files changed, 152 insertions(+), 16 deletions(-) -- 2.31.0.208.g409f899ff0-goog ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-07 15:42 [PATCH v2 0/2] Filename wipeout patch series updates Leah Rumancik @ 2021-04-07 15:42 ` Leah Rumancik 2021-04-07 21:33 ` Eric Biggers 2021-04-07 15:42 ` [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL Leah Rumancik 1 sibling, 1 reply; 16+ messages in thread From: Leah Rumancik @ 2021-04-07 15:42 UTC (permalink / raw) To: linux-ext4; +Cc: Leah Rumancik Zero out filename and file type fields when file is deleted. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> --- fs/ext4/namei.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 883e2a7cd4ab..0147c86de99e 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir, else de->inode = 0; inode_inc_iversion(dir); + + memset(de_del->name, 0, de_del->name_len); + memset(&de_del->file_type, 0, sizeof(__u8)); + return 0; } i += ext4_rec_len_from_disk(de->rec_len, blocksize); -- 2.31.0.208.g409f899ff0-goog ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-07 15:42 ` [PATCH v2 1/2] ext4: wipe filename upon file deletion Leah Rumancik @ 2021-04-07 21:33 ` Eric Biggers 2021-04-08 3:48 ` Theodore Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Eric Biggers @ 2021-04-07 21:33 UTC (permalink / raw) To: Leah Rumancik; +Cc: linux-ext4 On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote: > Zero out filename and file type fields when file is deleted. Why? Also what about when a dir_entry is moved? Wouldn't you want to make sure that any copies don't get left around? > > Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> > --- > fs/ext4/namei.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 883e2a7cd4ab..0147c86de99e 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir, > else > de->inode = 0; > inode_inc_iversion(dir); > + > + memset(de_del->name, 0, de_del->name_len); > + memset(&de_del->file_type, 0, sizeof(__u8)); The second line could be written as simply 'de_del->file_type = 0'. Also why zero these two fields in particular and not the whole dir_entry? - Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-07 21:33 ` Eric Biggers @ 2021-04-08 3:48 ` Theodore Ts'o 2021-04-08 5:21 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2021-04-08 3:48 UTC (permalink / raw) To: Eric Biggers; +Cc: Leah Rumancik, linux-ext4 On Wed, Apr 07, 2021 at 02:33:15PM -0700, Eric Biggers wrote: > On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote: > > Zero out filename and file type fields when file is deleted. > > Why? Eric is right that we need to have a better explanation in the commit description. In answer to Eric's question, the problem that is trying to be solved here is that if a customer happens to be storing PII in filenames (e-mail addresses, SSN's, etc.) that they might want to have a guarantee that if a file is deleted, the filename and the file's contents can be considered as *gone* after some wipeout time period has elapsed. So the use case is every N hours, some system daemon will execute FITRIM and FS_IOC_CHKPT_JRNL with the CHKPT_JRNL_DISCARD flag set, in order to meet this particular guarantee. Yes, we can't guarantee that discard will always zap all unused data blocks in the general case and SSD's may very well have copies made as a side effect of their GC operations, yadda, yadda. Fortunately, for this particular use case, the primary concern is for cloud customers running services on Google Cloud's Persistent Disks. > Also what about when a dir_entry is moved? Wouldn't you want to make sure that > any copies don't get left around? Yes, we need to also make sure that after we do a hash tree leaf node split in fs/ext4/namei.c:do_split(), that the empty space is also zero'ed out. > > @@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir, > > else > > de->inode = 0; > > inode_inc_iversion(dir); > > + > > + memset(de_del->name, 0, de_del->name_len); > > + memset(&de_del->file_type, 0, sizeof(__u8)); > > The second line could be written as simply 'de_del->file_type = 0'. > > Also why zero these two fields in particular and not the whole dir_entry? Agreed, it would be a good diea to clear everything up to rec_len: memset(de_del->name, 0, de_del->rec_len - 12); and we should probably zero out de_del->name_len as well. Better to not leave anything behind. - Ted P.S. By the way, this is a guarantee that we're going to eventually want to care about for XFS as well, since as of COS-85 (Container-Optimized OS), XFS is supported in Preview Mode. This means that eventually we're going to want submit patches so as to be able to support the CHKPT_JRNL_DISCARD flag for FS_IOC_CHKPT_JRNL in XFS as well. P.P.S. We'll also want to have a mount option which supresses file names (for example, from ext4_error() messages) from showing up in kernel logs, to ease potential privacy concerns with respect to serial console and kernel logs. But that's for another patch set.... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-08 3:48 ` Theodore Ts'o @ 2021-04-08 5:21 ` Dave Chinner 2021-04-08 19:25 ` Theodore Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2021-04-08 5:21 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Eric Biggers, Leah Rumancik, linux-ext4 On Wed, Apr 07, 2021 at 11:48:40PM -0400, Theodore Ts'o wrote: > On Wed, Apr 07, 2021 at 02:33:15PM -0700, Eric Biggers wrote: > > On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote: > > > Zero out filename and file type fields when file is deleted. > > > > Why? > > Eric is right that we need to have a better explanation in the commit > description. > > In answer to Eric's question, the problem that is trying to be solved > here is that if a customer happens to be storing PII in filenames "if" Is this purely a hypothetical "if", or is it "we have a customer that actaully does this"? Because if this is just hypothetical, then future customers should already be advised and know not to store PII information in clear text *anywhere* in their systems. > (e-mail addresses, SSN's, etc.) that they might want to have a > guarantee that if a file is deleted, the filename and the file's > contents can be considered as *gone* after some wipeout time period > has elapsed. So the use case is every N hours, some system daemon > will execute FITRIM and FS_IOC_CHKPT_JRNL with the CHKPT_JRNL_DISCARD > flag set, in order to meet this particular guarantee. This seems like a better fit for FITRIM than anything else. Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field, so we can't extend that. But it still makes more sense to me to have something like: int fstrim(int fd, struct fstrim_range *r, int flags) syscall where the flags field can indicate that the journal should be trimmed. At that point, the "journal checkpoint and flush" is implied by the fact userspace is asking for the journal to be discarded.... > P.S. By the way, this is a guarantee that we're going to eventually > want to care about for XFS as well, since as of COS-85 > (Container-Optimized OS), XFS is supported in Preview Mode. This > means that eventually we're going to want submit patches so as to be > able to support the CHKPT_JRNL_DISCARD flag for FS_IOC_CHKPT_JRNL in > XFS as well. Oh, that won't be fun. XFS places a whiteout over the dirent to indicate that it has been freed, and it does not actually log anything other than the 4 byte whiteout at the start of the dirent and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent. So zeroing dirents is going to require changing the size and shape of dirent logging during unlinks... This will have to be done correclty for all the node merge, split and compaction cases, too, not just the "remove name" code. > P.P.S. We'll also want to have a mount option which supresses file > names (for example, from ext4_error() messages) from showing up in > kernel logs, to ease potential privacy concerns with respect to serial > console and kernel logs. But that's for another patch set.... This sounds more and more like "Don't encode PII in clear text anywhere" is a best practice that should be enforced with a big hammer. Filenames get everywhere and there's no easy way to prevent that because path lookups can be done by anyone in the kernel. This so much sounds like you're starting a game of whack-a-mole that can never be won. From a security perspective, this is just bad design. Storing PII in clear text filenames pretty much guarantees that the PII will leak because it can't be scrubbed/contained within application controlled boundaries. Trying to contain the spread of filenames within random kernel subsystems sounds like a fool's errand to me, especially given how few kernel developers will even know that filenames are considered sensitive information from a security perspective... Fundamentally, applications should *never* place PII in clear text in potentially leaky environments. The environment for storing PII should be designed to be secure and free of data leaks from the ground up. And ext4 has already got this with fscrypt support..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-08 5:21 ` Dave Chinner @ 2021-04-08 19:25 ` Theodore Ts'o 2021-04-09 0:02 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2021-04-08 19:25 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Biggers, Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 03:21:55PM +1000, Dave Chinner wrote: > "if" > > Is this purely a hypothetical "if", or is it "we have a customer > that actaully does this"? Because if this is just hypothetical, then > future customers should already be advised and know not to store PII > information in clear text *anywhere* in their systems. Customers might not even know if they are doing this. The concern they have is that when they are doing "lift and shift", and moving their workloads from their private data center to the cloud, they would have no idea what might be lurking in their legacy workloads. Heck, in a previous job, I visited a major customer who had lost their sources to a critical bit of their software and they were changing pathnames by running a hex editor on the binary. Enterprise customers do the darnedest things... (OTOH, they pay $$$ to our companies :-) In the ideal world, sure, all or most of them would agree that they *shouldn't* be storing any kind of PII at rest unencrypted, but they can't be sure, and so from the perspective of keeping their audit and I/T compliance committees happy, this requirement is desirable from a "belt and suspenders" perspective. > This seems like a better fit for FITRIM than anything else. > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field, > so we can't extend that. I don't have any serious objections to defining FITRIM2; OTOH, for Darrick's use case of wanting to make XFS work reliably with the GRUB bootloader (which doesn't replay file system journals) and a certain Enterprise Linux distribution (cough, cough) can't be bothered to do a clean shutdown at the end of doing an install, using a hypothetical FITRIM2 seems... wierd, whereas it might be cleaner to define semantics in terms of FS_IOC_CHKPT_JRNL. But I don't have a dog in that particular fight, since I'm not responsible of maintaining either XFS or RHEL. :-) Yet another possible solution might be to define a new system call, syncfs2(), which takes a flag option. That might be a bit more heavyweight, and we would still have to figure out how to define what a "journal checkpoint means" from the standpoint of an API definition. It would presumably be something like "allows a non-kernel implementation accessing the file system (e.g., from bootloaders like grub) to be able to access files on the file sytstem as easily as unmounting the file system", or perhaps defining it in terms of doing a FIFREEZE/FITHAW, without having to actually freeze the file system. > Oh, that won't be fun. XFS places a whiteout over the dirent to > indicate that it has been freed, and it does not actually log > anything other than the 4 byte whiteout at the start of the dirent > and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent. > So zeroing dirents is going to require changing the size and shape > of dirent logging during unlinks... So I'm not an expert on XFS, but XFS does logical logging, so what is in the log is "we're going to white out this dirent", right? So couldn't the replay code be taught to look at the dirent's reclen, and zero out the full directory entry at journal replay time? If the directory entry has already been reused, that's a case which the XFS replay code has to handle already. Or is there something subtle which makes this hard to do. > This will have to be done correclty for all the node merge, split > and compaction cases, too, not just the "remove name" code. Agreed this is going to be a lot more complicated for XFS. > > P.P.S. We'll also want to have a mount option which supresses file > > names (for example, from ext4_error() messages) from showing up in > > kernel logs, to ease potential privacy concerns with respect to serial > > console and kernel logs. But that's for another patch set.... > > This sounds more and more like "Don't encode PII in clear text > anywhere" is a best practice that should be enforced with a big > hammer. Filenames get everywhere and there's no easy way to prevent > that because path lookups can be done by anyone in the kernel. This > so much sounds like you're starting a game of whack-a-mole that can > never be won. > > From a security perspective, this is just bad design. Storing PII in > clear text filenames pretty much guarantees that the PII will leak > because it can't be scrubbed/contained within application controlled > boundaries. Trying to contain the spread of filenames within random > kernel subsystems sounds like a fool's errand to me, especially > given how few kernel developers will even know that filenames are > considered sensitive information from a security perspective... The problem is that the company that the Cloud SRE's work for is different from the enterprise customer owning the VM. If a customer stores PII in a filename, and the Cloud SRE places the log in some place where it could leak out, Google gets blamed, not the enterprise customer. So the Cloud SRE's *have* to treat the logs as if they might contain sensitive information, which means it can't be made available in bug trackers without a manual (human-drive) scrubbing to make sure the log doesn't have anything that might appear to contain PII. By the way, MySQL puts table names into file names, and even though table names normally aren't PII, it's still considered "customer data", and we need to treat any kind of customer data super-carefully. > Fundamentally, applications should *never* place PII in clear text > in potentially leaky environments. The environment for storing PII > should be designed to be secure and free of data leaks from the > ground up. And ext4 has already got this with fscrypt support..... Cloud disks (no matter which cloud vendor) do tend to be encrypted at rest, and in the case of Google, we even give customers the option to Bring Your Own Key, which means when the VM isn't running, no one at Google has access to the encryption key. But that doesn't change the fact that if we need to debug a system, the logs might contain file names, and file names might be customer-owned data. Using encryption at rest doesn't solve that problem. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-08 19:25 ` Theodore Ts'o @ 2021-04-09 0:02 ` Darrick J. Wong 2021-04-09 2:51 ` Theodore Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2021-04-09 0:02 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, Eric Biggers, Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 03:25:30PM -0400, Theodore Ts'o wrote: > On Thu, Apr 08, 2021 at 03:21:55PM +1000, Dave Chinner wrote: > > "if" > > > > Is this purely a hypothetical "if", or is it "we have a customer > > that actaully does this"? Because if this is just hypothetical, then > > future customers should already be advised and know not to store PII > > information in clear text *anywhere* in their systems. > > Customers might not even know if they are doing this. The concern > they have is that when they are doing "lift and shift", and moving > their workloads from their private data center to the cloud, they > would have no idea what might be lurking in their legacy workloads. Customers don't know and don't care so long as they can get away with it. Two seconds after we grant them write access to the filesystem, they've written your covid19 vaccination card selfie^W^W^Wcredit card number to permanent storage. (Err I mean, they care after the breach goes public and gaslighting the victims^Wcustomers results in lawsuits that stick, so I guess maybe we should do the world a solid and try to prevent that...) > Heck, in a previous job, I visited a major customer who had lost their > sources to a critical bit of their software and they were changing > pathnames by running a hex editor on the binary. Enterprise customers > do the darnedest things... (OTOH, they pay $$$ to our companies :-) Heh. I never thought of myself as an enterprise customer. Editing binaries is fun. > In the ideal world, sure, all or most of them would agree that they > *shouldn't* be storing any kind of PII at rest unencrypted, but they > can't be sure, and so from the perspective of keeping their audit and > I/T compliance committees happy, this requirement is desirable from a > "belt and suspenders" perspective. > > > This seems like a better fit for FITRIM than anything else. > > > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field, > > so we can't extend that. > > I don't have any serious objections to defining FITRIM2; OTOH, for Er, are we talking about the directory name wiping, or the journal discarding? FITRIM is a horrible interface for directory name wiping. I'm assuming you're talking about the journal discard patch, but it's a little strange that these are all replies to the patch adding directory name wipes. > Darrick's use case of wanting to make XFS work reliably with the GRUB > bootloader (which doesn't replay file system journals) and a certain > Enterprise Linux distribution (cough, cough) can't be bothered to do a To be fair, the certain Enterprise Linux distro actually does /sbin/reboot properly like you'd expect; it's the clod users that (heh, I'll leave the typo in) who expect that they can save time by making their systems unrebootable. > clean shutdown at the end of doing an install, using a hypothetical > FITRIM2 seems... wierd, whereas it might be cleaner to define > semantics in terms of FS_IOC_CHKPT_JRNL. But I don't have a dog in > that particular fight, since I'm not responsible of maintaining either > XFS or RHEL. :-) > Yet another possible solution might be to define a new system call, > syncfs2(), which takes a flag option. That might be a bit more > heavyweight, and we would still have to figure out how to define what > a "journal checkpoint means" from the standpoint of an API definition. > It would presumably be something like "allows a non-kernel > implementation accessing the file system (e.g., from bootloaders like > grub) to be able to access files on the file sytstem as easily as > unmounting the file system", or perhaps defining it in terms of doing > a FIFREEZE/FITHAW, without having to actually freeze the file system. Let's move this part of the discussion over to the other patch, please. > > Oh, that won't be fun. XFS places a whiteout over the dirent to > > indicate that it has been freed, and it does not actually log > > anything other than the 4 byte whiteout at the start of the dirent > > and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent. > > So zeroing dirents is going to require changing the size and shape > > of dirent logging during unlinks... > > So I'm not an expert on XFS, but XFS does logical logging, so what is > in the log is "we're going to white out this dirent", right? So > couldn't the replay code be taught to look at the dirent's reclen, and > zero out the full directory entry at journal replay time? If the > directory entry has already been reused, that's a case which the XFS > replay code has to handle already. Or is there something subtle which > makes this hard to do. > > > This will have to be done correclty for all the node merge, split > > and compaction cases, too, not just the "remove name" code. > > Agreed this is going to be a lot more complicated for XFS. I didn't think it was any more difficult than changing xfs_removename to zero out the name and ftype fields at the same time it adds the whiteout to the dirent. But TBH I haven't thought through this too deeply. I /do/ think that if you ever want to add "secure" deletion to XFS, I'd want to do it by implementing FS_SECRM_FL for XFS, and not by adding more mount options. > > > P.P.S. We'll also want to have a mount option which supresses file > > > names (for example, from ext4_error() messages) from showing up in > > > kernel logs, to ease potential privacy concerns with respect to serial > > > console and kernel logs. But that's for another patch set.... > > > > This sounds more and more like "Don't encode PII in clear text > > anywhere" is a best practice that should be enforced with a big > > hammer. How? Encrypting the entire block device? Implementing fscrypt for all filesystems? Sending Daves to companies everywhere to heckle them for their poor sekuritee practices? ;) > > Filenames get everywhere and there's no easy way to prevent > > that because path lookups can be done by anyone in the kernel. This > > so much sounds like you're starting a game of whack-a-mole that can > > never be won. > > > > From a security perspective, this is just bad design. Storing PII in > > clear text filenames pretty much guarantees that the PII will leak > > because it can't be scrubbed/contained within application controlled > > boundaries. Trying to contain the spread of filenames within random > > kernel subsystems sounds like a fool's errand to me, especially > > given how few kernel developers will even know that filenames are > > considered sensitive information from a security perspective... > > The problem is that the company that the Cloud SRE's work for is > different from the enterprise customer owning the VM. If a customer > stores PII in a filename, and the Cloud SRE places the log in some > place where it could leak out, Google gets blamed, not the enterprise > customer. > > So the Cloud SRE's *have* to treat the logs as if they might contain > sensitive information, which means it can't be made available in bug > trackers without a manual (human-drive) scrubbing to make sure the log > doesn't have anything that might appear to contain PII. Question -- does e2image have the ability to obscure names like xfs_metadump will do if you don't pass it "-o" ? > By the way, MySQL puts table names into file names, and even though > table names normally aren't PII, it's still considered "customer > data", and we need to treat any kind of customer data super-carefully. <shrug> I don't have any objection to logging the directory inode number and the offset of the dirent. Replacing "Error in path /AUTOEXEC.BAT" with "Error in directory 0x1515 offset 23568" solves the PII-in-logfiles problem, right? And still leaves us enough of a breadcrumb that we can figure out where fstests went wrong. > > Fundamentally, applications should *never* place PII in clear text > > in potentially leaky environments. The environment for storing PII > > should be designed to be secure and free of data leaks from the > > ground up. And ext4 has already got this with fscrypt support..... > > Cloud disks (no matter which cloud vendor) do tend to be encrypted at > rest, and in the case of Google, we even give customers the option to > Bring Your Own Key, which means when the VM isn't running, no one at > Google has access to the encryption key. But that doesn't change the > fact that if we need to debug a system, the logs might contain file > names, and file names might be customer-owned data. Using encryption > at rest doesn't solve that problem. <nod> --D > > - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-09 0:02 ` Darrick J. Wong @ 2021-04-09 2:51 ` Theodore Ts'o 2021-04-11 23:38 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2021-04-09 2:51 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, Eric Biggers, Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 05:02:07PM -0700, Darrick J. Wong wrote: > > In the ideal world, sure, all or most of them would agree that they > > *shouldn't* be storing any kind of PII at rest unencrypted, but they > > can't be sure, and so from the perspective of keeping their audit and > > I/T compliance committees happy, this requirement is desirable from a > > "belt and suspenders" perspective. > > > > > This seems like a better fit for FITRIM than anything else. > > > > > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field, > > > so we can't extend that. > > > > I don't have any serious objections to defining FITRIM2; OTOH, for > > Er, are we talking about the directory name wiping, or the journal > discarding? Sorry, I was talking about journal wiping. The conflation is because the reason why we want to wipe the journal is because of the directory names in the journal, so the two are very much connected for our use case, but yes, directory names in directories is very from directory names in the journal. We don't actually need any kind of interface for wiping names in directories, since it doesn't cost us anything to unconditionally wipe the directory entries as opposed to just setting the inode number to zero. > I didn't think it was any more difficult than changing xfs_removename to > zero out the name and ftype fields at the same time it adds the whiteout > to the dirent. But TBH I haven't thought through this too deeply. > > I /do/ think that if you ever want to add "secure" deletion to XFS, I'd > want to do it by implementing FS_SECRM_FL for XFS, and not by adding > more mount options. The original meaning of FS_SECRM_FL was that the data blocks would be zero'ed --- when the inode was deleted. We don't intend to have a mount option for ext4 for zero'ing the directory entry, since it really doesn't cost us anything to memset the directory entry to zero at unlink time. I guess for a DAX file system, zero'ing the directory entry might cost a an extra cache line write, but for block-oriented devices, for us it's essentially cost-free --- so why add an extra mount option, and instead just zero the directory entry of everything other than rec_len? > Question -- does e2image have the ability to obscure names like > xfs_metadump will do if you don't pass it "-o" ? Yes, e2image has had the -s option to scramble file names since E2fsprogs 1.36 (February, 2005). - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] ext4: wipe filename upon file deletion 2021-04-09 2:51 ` Theodore Ts'o @ 2021-04-11 23:38 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2021-04-11 23:38 UTC (permalink / raw) To: Theodore Ts'o Cc: Darrick J. Wong, Eric Biggers, Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 10:51:09PM -0400, Theodore Ts'o wrote: > On Thu, Apr 08, 2021 at 05:02:07PM -0700, Darrick J. Wong wrote: > > > In the ideal world, sure, all or most of them would agree that they > > > *shouldn't* be storing any kind of PII at rest unencrypted, but they > > > can't be sure, and so from the perspective of keeping their audit and > > > I/T compliance committees happy, this requirement is desirable from a > > > "belt and suspenders" perspective. > > > > > > > This seems like a better fit for FITRIM than anything else. > > > > > > > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field, > > > > so we can't extend that. > > > > > > I don't have any serious objections to defining FITRIM2; OTOH, for > > > > Er, are we talking about the directory name wiping, or the journal > > discarding? > > Sorry, I was talking about journal wiping. The conflation is because > the reason why we want to wipe the journal is because of the directory > names in the journal, so the two are very much connected for our use > case, but yes, directory names in directories is very from directory > names in the journal. > > We don't actually need any kind of interface for wiping names in > directories, since it doesn't cost us anything to unconditionally wipe > the directory entries as opposed to just setting the inode number to > zero. > > > I didn't think it was any more difficult than changing xfs_removename to > > zero out the name and ftype fields at the same time it adds the whiteout > > to the dirent. But TBH I haven't thought through this too deeply. > > > > I /do/ think that if you ever want to add "secure" deletion to XFS, I'd > > want to do it by implementing FS_SECRM_FL for XFS, and not by adding > > more mount options. > > The original meaning of FS_SECRM_FL was that the data blocks would be > zero'ed --- when the inode was deleted. Sure, if discard is Good Enough(tm) for the journal, then we just treat this flag like "-o discard" was enabled for this file. Let the hardware do the "zeroing" in the background once we mark the extent as free. And if the hardware supports secure erase in place of discard, then even better. In the case of XFS, if we are to implement this directory entry zeroing then we actually need to discard the directory blocks. We may elide writeback of the directory block altogether if it is removed from the directory entirely between journal checkpoints. In that situation, we just write a whiteout for the block to the journal (we cancel the buffer) and we never actually write that buffer's contents to disk as it has been marked free by the journal commit. And, similarly short form directories aren't in blocks and can't be discarded, and we can elide inode writeback in the case where the inode clusters are freed. Hence zeroing dirents held inline in the inodes are not guaranteed to hit the disk, either. So we'd need to discard inode clusters as well. IOWs, we can do "rm -rf" of a directory with tens of thousands of entries, and the only thing that ends up hitting stable storage is a few hundred buffer invalidations in the journal. They remain unmodified in free space after the journal commit. This is why I said "good luck" to fixing XFS not to leak directory entries to disk. It's a pretty major undertaking to audit, fix and verify all the paths that remove directory entries to ensure that we do not leak dirent names anywhere. And I haven't even touched on PII in extended attributes :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-07 15:42 [PATCH v2 0/2] Filename wipeout patch series updates Leah Rumancik 2021-04-07 15:42 ` [PATCH v2 1/2] ext4: wipe filename upon file deletion Leah Rumancik @ 2021-04-07 15:42 ` Leah Rumancik 2021-04-07 18:35 ` Darrick J. Wong 1 sibling, 1 reply; 16+ messages in thread From: Leah Rumancik @ 2021-04-07 15:42 UTC (permalink / raw) To: linux-ext4; +Cc: Leah Rumancik ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. With the filename wipeout patch, Ext4 guarantees that all data will be discarded on deletion. This ioctl allows for periodically discarding journal contents too. Also, add journal discard (if discard supported) during journal load after recovery. This provides a potential solution to https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ for disks that support discard. After a successful journal recovery, e2fsck can call this ioctl to discard the journal as well. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> --- fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 4 +- fs/ext4/ioctl.c | 34 ++++++++++++-- fs/ext4/super.c | 6 +-- fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++- fs/ocfs2/alloc.c | 2 +- fs/ocfs2/journal.c | 8 ++-- include/linux/jbd2.h | 5 +- include/uapi/linux/fs.h | 4 ++ 9 files changed, 148 insertions(+), 16 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 826a56e3bbd2..e76e9961e992 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -724,6 +724,7 @@ enum { #define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40) #define EXT4_IOC_GETSTATE _IOW('f', 41, __u32) #define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap) +/* ioctl code 43 reserved for FS_IOC_JRNL_CHKPT */ #define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0948a43f1b3d..715077e30c58 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3225,7 +3225,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block) ext4_clear_inode_state(inode, EXT4_STATE_JDATA); journal = EXT4_JOURNAL(inode); jbd2_journal_lock_updates(journal); - err = jbd2_journal_flush(journal); + err = jbd2_journal_flush(journal, 0); jbd2_journal_unlock_updates(journal); if (err) @@ -6007,7 +6007,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) if (val) ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA); else { - err = jbd2_journal_flush(journal); + err = jbd2_journal_flush(journal, 0); if (err < 0) { jbd2_journal_unlock_updates(journal); percpu_up_write(&sbi->s_writepages_rwsem); diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index a2cf35066f46..ca64c680ef6d 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -756,7 +756,7 @@ static long ext4_ioctl_group_add(struct file *file, err = ext4_group_add(sb, input); if (EXT4_SB(sb)->s_journal) { jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); } if (err == 0) @@ -939,7 +939,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count); if (EXT4_SB(sb)->s_journal) { jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); } if (err == 0) @@ -1082,7 +1082,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) if (EXT4_SB(sb)->s_journal) { ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE); jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); } if (err == 0) @@ -1318,6 +1318,33 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) return -EOPNOTSUPP; return fsverity_ioctl_read_metadata(filp, (const void __user *)arg); + case FS_IOC_CHKPT_JRNL: + { + int err = 0; + unsigned long long flags = 0; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* file argument is not the mount point */ + if (file_dentry(filp) != sb->s_root) + return -EINVAL; + + /* filesystem is not backed by block device */ + if (sb->s_bdev == NULL) + return -EINVAL; + + if (copy_from_user(&flags, (__u64 __user *)arg, + sizeof(__u64))) + return -EFAULT; + + if (EXT4_SB(sb)->s_journal) { + jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, flags); + jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); + } + return err; + } default: return -ENOTTY; @@ -1407,6 +1434,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) case EXT4_IOC_GET_ES_CACHE: case FS_IOC_FSGETXATTR: case FS_IOC_FSSETXATTR: + case FS_IOC_CHKPT_JRNL: break; default: return -ENOIOCTLCMD; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b9693680463a..1b3a9eb58b63 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5613,7 +5613,7 @@ static int ext4_mark_recovery_complete(struct super_block *sb, return 0; } jbd2_journal_lock_updates(journal); - err = jbd2_journal_flush(journal); + err = jbd2_journal_flush(journal, 0); if (err < 0) goto out; @@ -5755,7 +5755,7 @@ static int ext4_freeze(struct super_block *sb) * Don't clear the needs_recovery flag if we failed to * flush the journal. */ - error = jbd2_journal_flush(journal); + error = jbd2_journal_flush(journal, 0); if (error < 0) goto out; @@ -6349,7 +6349,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, * otherwise be livelocked... */ jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); - err = jbd2_journal_flush(EXT4_SB(sb)->s_journal); + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); if (err) return err; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 2dc944442802..6bb5980ac789 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -1686,6 +1686,90 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op) write_unlock(&journal->j_state_lock); } +/* discard journal blocks excluding journal superblock */ +static int __jbd2_journal_issue_discard(journal_t *journal) +{ + int err = 0; + unsigned long block, log_offset; /* logical */ + unsigned long long phys_block, block_start, block_stop; /* physical */ + loff_t byte_start, byte_stop, byte_count; + struct request_queue *q = bdev_get_queue(journal->j_dev); + + if (!q) + return -ENXIO; + + if (!blk_queue_discard(q)) + return -EOPNOTSUPP; + + /* lookup block mapping and issue discard for each contiguous region */ + log_offset = be32_to_cpu(journal->j_superblock->s_first); + + err = jbd2_journal_bmap(journal, log_offset, &block_start); + if (err) { + printk(KERN_ERR "JBD2: bad block at offset %lu", log_offset); + return err; + } + + /* + * use block_start - 1 to meet check for contiguous with previous region: + * phys_block == block_stop + 1 + */ + block_stop = block_start - 1; + + for (block = log_offset; block < journal->j_total_len; block++) { + err = jbd2_journal_bmap(journal, block, &phys_block); + if (err) { + printk(KERN_ERR "JBD2: bad block at offset %lu", block); + return err; + } + + /* + * if block is last block, update stopping point + * if not last block and + * block is contiguous with previous block, continue + */ + if (block == journal->j_total_len - 1) + block_stop = phys_block; + else if (phys_block == block_stop + 1) { + block_stop++; + continue; + } + + /* + * if not contiguous with prior physical block or this is last + * block of journal, take care of the region + */ + byte_start = block_start * journal->j_blocksize; + byte_stop = block_stop * journal->j_blocksize; + byte_count = (block_stop - block_start + 1) * + journal->j_blocksize; + + truncate_inode_pages_range(journal->j_dev->bd_inode->i_mapping, + byte_start, byte_stop); + + /* + * use blkdev_issue_discard instead of sb_issue_discard + * because superblock not yet populated when this is + * called during journal_load during mount process + */ + err = blkdev_issue_discard(journal->j_dev, + byte_start >> SECTOR_SHIFT, + byte_count >> SECTOR_SHIFT, + GFP_NOFS, 0); + + if (unlikely(err != 0)) { + printk(KERN_ERR "JBD2: unable to discard " + "journal at physical blocks %llu - %llu", + block_start, block_stop); + return err; + } + + block_start = phys_block; + block_stop = phys_block; + } + + return blkdev_issue_flush(journal->j_dev); +} /** * jbd2_journal_update_sb_errno() - Update error in the journal. @@ -1936,6 +2020,10 @@ int jbd2_journal_load(journal_t *journal) */ journal->j_flags &= ~JBD2_ABORT; + /* if journal device supports discard, discard journal blocks */ + if (__jbd2_journal_issue_discard(journal)) + printk(KERN_WARNING "JBD2: failed to discard journal when loading"); + /* OK, we've finished with the dynamic journal bits: * reinitialise the dynamic contents of the superblock in memory * and reset them on disk. */ @@ -2246,13 +2334,17 @@ EXPORT_SYMBOL(jbd2_journal_clear_features); /** * jbd2_journal_flush() - Flush journal * @journal: Journal to act on. + * @flags: options (see below) * * Flush all data for a given journal to disk and empty the journal. * Filesystems can use this when remounting readonly to ensure that * recovery does not need to happen on remount. + * + * flags: + * JBD2_FLAG_DO_DISCARD: also discard the journal blocks; if discard is not + * supported on the device, returns err */ - -int jbd2_journal_flush(journal_t *journal) +int jbd2_journal_flush(journal_t *journal, unsigned long long flags) { int err = 0; transaction_t *transaction = NULL; @@ -2306,6 +2398,10 @@ int jbd2_journal_flush(journal_t *journal) * commits of data to the journal will restore the current * s_start value. */ jbd2_mark_journal_empty(journal, REQ_SYNC | REQ_FUA); + + if (flags & JBD2_FLAG_DO_DISCARD) + err = __jbd2_journal_issue_discard(journal); + mutex_unlock(&journal->j_checkpoint_mutex); write_lock(&journal->j_state_lock); J_ASSERT(!journal->j_running_transaction); diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 78710788c237..1b41bf9f4a7e 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -6020,7 +6020,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) * Then truncate log will be replayed resulting in cluster double free. */ jbd2_journal_lock_updates(journal->j_journal); - status = jbd2_journal_flush(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal, 0); jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) { mlog_errno(status); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index db52e843002a..a1438548747e 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -310,7 +310,7 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb) } jbd2_journal_lock_updates(journal->j_journal); - status = jbd2_journal_flush(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal, 0); jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) { up_write(&journal->j_trans_barrier); @@ -1002,7 +1002,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) if (ocfs2_mount_local(osb)) { jbd2_journal_lock_updates(journal->j_journal); - status = jbd2_journal_flush(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal, 0); jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) mlog_errno(status); @@ -1072,7 +1072,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) if (replayed) { jbd2_journal_lock_updates(journal->j_journal); - status = jbd2_journal_flush(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal, 0); jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) mlog_errno(status); @@ -1668,7 +1668,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, /* wipe the journal */ jbd2_journal_lock_updates(journal); - status = jbd2_journal_flush(journal); + status = jbd2_journal_flush(journal, 0); jbd2_journal_unlock_updates(journal); if (status < 0) mlog_errno(status); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 99d3cd051ac3..50510473283e 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -307,6 +307,9 @@ typedef struct journal_superblock_s JBD2_FEATURE_INCOMPAT_CSUM_V3 | \ JBD2_FEATURE_INCOMPAT_FAST_COMMIT) +/* discard journal blocks flag for jbd2_journal_flush */ +#define JBD2_FLAG_DO_DISCARD 1 + #ifdef __KERNEL__ #include <linux/fs.h> @@ -1491,7 +1494,7 @@ extern int jbd2_journal_invalidatepage(journal_t *, struct page *, unsigned int, unsigned int); extern int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page); extern int jbd2_journal_stop(handle_t *); -extern int jbd2_journal_flush (journal_t *); +extern int jbd2_journal_flush(journal_t *journal, unsigned long long flags); extern void jbd2_journal_lock_updates (journal_t *); extern void jbd2_journal_unlock_updates (journal_t *); diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index f44eb0a04afd..d66408c38c1d 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -214,6 +214,7 @@ struct fsxattr { #define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr) #define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX]) #define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX]) +#define FS_IOC_CHKPT_JRNL _IOW('f', 43, __u64) /* * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS) @@ -304,4 +305,7 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* flag for ioctl FS_IOC_JRNL_CHKPT */ +#define CHKPT_JRNL_DO_DISCARD 1 + #endif /* _UAPI_LINUX_FS_H */ -- 2.31.0.208.g409f899ff0-goog ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-07 15:42 ` [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL Leah Rumancik @ 2021-04-07 18:35 ` Darrick J. Wong 2021-04-07 21:15 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2021-04-07 18:35 UTC (permalink / raw) To: Leah Rumancik; +Cc: linux-ext4 On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > With the filename wipeout patch, Ext4 guarantees that all data will be > discarded on deletion. This ioctl allows for periodically discarding > journal contents too. This needs a documentation update to cover what this new userspace ABI does, and probably linux-api and linux-fsdevel should be cc'd. > Also, add journal discard (if discard supported) during journal load > after recovery. This provides a potential solution to > https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ for > disks that support discard. After a successful journal recovery, e2fsck > can call this ioctl to discard the journal as well. > > Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> > --- > fs/ext4/ext4.h | 1 + > fs/ext4/inode.c | 4 +- > fs/ext4/ioctl.c | 34 ++++++++++++-- > fs/ext4/super.c | 6 +-- > fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++- > fs/ocfs2/alloc.c | 2 +- > fs/ocfs2/journal.c | 8 ++-- > include/linux/jbd2.h | 5 +- > include/uapi/linux/fs.h | 4 ++ > 9 files changed, 148 insertions(+), 16 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 826a56e3bbd2..e76e9961e992 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -724,6 +724,7 @@ enum { > #define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40) > #define EXT4_IOC_GETSTATE _IOW('f', 41, __u32) > #define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap) > +/* ioctl code 43 reserved for FS_IOC_JRNL_CHKPT */ Pat Sajak hates it when people won't buy vowels. ;) FS_IOC_CHECKPOINT, maybe? > > #define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0948a43f1b3d..715077e30c58 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3225,7 +3225,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block) > ext4_clear_inode_state(inode, EXT4_STATE_JDATA); > journal = EXT4_JOURNAL(inode); > jbd2_journal_lock_updates(journal); > - err = jbd2_journal_flush(journal); > + err = jbd2_journal_flush(journal, 0); > jbd2_journal_unlock_updates(journal); > > if (err) > @@ -6007,7 +6007,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) > if (val) > ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA); > else { > - err = jbd2_journal_flush(journal); > + err = jbd2_journal_flush(journal, 0); > if (err < 0) { > jbd2_journal_unlock_updates(journal); > percpu_up_write(&sbi->s_writepages_rwsem); > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c > index a2cf35066f46..ca64c680ef6d 100644 > --- a/fs/ext4/ioctl.c > +++ b/fs/ext4/ioctl.c > @@ -756,7 +756,7 @@ static long ext4_ioctl_group_add(struct file *file, > err = ext4_group_add(sb, input); > if (EXT4_SB(sb)->s_journal) { > jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); > - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); > + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); > jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); > } > if (err == 0) > @@ -939,7 +939,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count); > if (EXT4_SB(sb)->s_journal) { > jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); > - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); > + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); > jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); > } > if (err == 0) > @@ -1082,7 +1082,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > if (EXT4_SB(sb)->s_journal) { > ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE); > jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); > - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); > + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); > jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); > } > if (err == 0) > @@ -1318,6 +1318,33 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > return -EOPNOTSUPP; > return fsverity_ioctl_read_metadata(filp, > (const void __user *)arg); > + case FS_IOC_CHKPT_JRNL: > + { Should this whole block be a separate static function? > + int err = 0; > + unsigned long long flags = 0; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + /* file argument is not the mount point */ > + if (file_dentry(filp) != sb->s_root) > + return -EINVAL; > + > + /* filesystem is not backed by block device */ > + if (sb->s_bdev == NULL) > + return -EINVAL; > + > + if (copy_from_user(&flags, (__u64 __user *)arg, > + sizeof(__u64))) > + return -EFAULT; Needs a check here to return -EINVAL if any flags bits are set other than CHKPT_JRNL_DO_DISCARD, particularly since you're feeding flags straight into jbd2 code. FWIW I would also separate the changes into two patches -- one to add the flags paramter to jbd2_journal_flush, and a second patch to define the new ioctl and wire it up. > + > + if (EXT4_SB(sb)->s_journal) { > + jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); > + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, flags); > + jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); > + } > + return err; > + } > > default: > return -ENOTTY; > @@ -1407,6 +1434,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > case EXT4_IOC_GET_ES_CACHE: > case FS_IOC_FSGETXATTR: > case FS_IOC_FSSETXATTR: > + case FS_IOC_CHKPT_JRNL: > break; > default: > return -ENOIOCTLCMD; > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index b9693680463a..1b3a9eb58b63 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -5613,7 +5613,7 @@ static int ext4_mark_recovery_complete(struct super_block *sb, > return 0; > } > jbd2_journal_lock_updates(journal); > - err = jbd2_journal_flush(journal); > + err = jbd2_journal_flush(journal, 0); > if (err < 0) > goto out; > > @@ -5755,7 +5755,7 @@ static int ext4_freeze(struct super_block *sb) > * Don't clear the needs_recovery flag if we failed to > * flush the journal. > */ > - error = jbd2_journal_flush(journal); > + error = jbd2_journal_flush(journal, 0); > if (error < 0) > goto out; > > @@ -6349,7 +6349,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, > * otherwise be livelocked... > */ > jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); > - err = jbd2_journal_flush(EXT4_SB(sb)->s_journal); > + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0); > jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); > if (err) > return err; > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > index 2dc944442802..6bb5980ac789 100644 > --- a/fs/jbd2/journal.c > +++ b/fs/jbd2/journal.c > @@ -1686,6 +1686,90 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op) > write_unlock(&journal->j_state_lock); > } > > +/* discard journal blocks excluding journal superblock */ > +static int __jbd2_journal_issue_discard(journal_t *journal) > +{ > + int err = 0; > + unsigned long block, log_offset; /* logical */ > + unsigned long long phys_block, block_start, block_stop; /* physical */ > + loff_t byte_start, byte_stop, byte_count; > + struct request_queue *q = bdev_get_queue(journal->j_dev); > + > + if (!q) > + return -ENXIO; > + > + if (!blk_queue_discard(q)) > + return -EOPNOTSUPP; > + > + /* lookup block mapping and issue discard for each contiguous region */ > + log_offset = be32_to_cpu(journal->j_superblock->s_first); > + > + err = jbd2_journal_bmap(journal, log_offset, &block_start); > + if (err) { > + printk(KERN_ERR "JBD2: bad block at offset %lu", log_offset); > + return err; > + } > + > + /* > + * use block_start - 1 to meet check for contiguous with previous region: > + * phys_block == block_stop + 1 > + */ > + block_stop = block_start - 1; > + > + for (block = log_offset; block < journal->j_total_len; block++) { > + err = jbd2_journal_bmap(journal, block, &phys_block); > + if (err) { > + printk(KERN_ERR "JBD2: bad block at offset %lu", block); > + return err; > + } > + > + /* > + * if block is last block, update stopping point > + * if not last block and > + * block is contiguous with previous block, continue > + */ > + if (block == journal->j_total_len - 1) > + block_stop = phys_block; > + else if (phys_block == block_stop + 1) { > + block_stop++; > + continue; > + } > + > + /* > + * if not contiguous with prior physical block or this is last > + * block of journal, take care of the region > + */ > + byte_start = block_start * journal->j_blocksize; > + byte_stop = block_stop * journal->j_blocksize; > + byte_count = (block_stop - block_start + 1) * > + journal->j_blocksize; > + > + truncate_inode_pages_range(journal->j_dev->bd_inode->i_mapping, > + byte_start, byte_stop); > + > + /* > + * use blkdev_issue_discard instead of sb_issue_discard > + * because superblock not yet populated when this is > + * called during journal_load during mount process > + */ > + err = blkdev_issue_discard(journal->j_dev, > + byte_start >> SECTOR_SHIFT, > + byte_count >> SECTOR_SHIFT, > + GFP_NOFS, 0); > + > + if (unlikely(err != 0)) { > + printk(KERN_ERR "JBD2: unable to discard " > + "journal at physical blocks %llu - %llu", > + block_start, block_stop); > + return err; > + } > + > + block_start = phys_block; > + block_stop = phys_block; > + } > + > + return blkdev_issue_flush(journal->j_dev); > +} > > /** > * jbd2_journal_update_sb_errno() - Update error in the journal. > @@ -1936,6 +2020,10 @@ int jbd2_journal_load(journal_t *journal) > */ > journal->j_flags &= ~JBD2_ABORT; > > + /* if journal device supports discard, discard journal blocks */ > + if (__jbd2_journal_issue_discard(journal)) > + printk(KERN_WARNING "JBD2: failed to discard journal when loading"); How badly do you need to log this? The journal still works, right? This is going to trigger on all the old spinning disks that don't support discard. > + > /* OK, we've finished with the dynamic journal bits: > * reinitialise the dynamic contents of the superblock in memory > * and reset them on disk. */ > @@ -2246,13 +2334,17 @@ EXPORT_SYMBOL(jbd2_journal_clear_features); > /** > * jbd2_journal_flush() - Flush journal > * @journal: Journal to act on. > + * @flags: options (see below) > * > * Flush all data for a given journal to disk and empty the journal. > * Filesystems can use this when remounting readonly to ensure that > * recovery does not need to happen on remount. > + * > + * flags: > + * JBD2_FLAG_DO_DISCARD: also discard the journal blocks; if discard is not > + * supported on the device, returns err > */ > - > -int jbd2_journal_flush(journal_t *journal) > +int jbd2_journal_flush(journal_t *journal, unsigned long long flags) > { > int err = 0; > transaction_t *transaction = NULL; > @@ -2306,6 +2398,10 @@ int jbd2_journal_flush(journal_t *journal) > * commits of data to the journal will restore the current > * s_start value. */ > jbd2_mark_journal_empty(journal, REQ_SYNC | REQ_FUA); > + > + if (flags & JBD2_FLAG_DO_DISCARD) > + err = __jbd2_journal_issue_discard(journal); > + > mutex_unlock(&journal->j_checkpoint_mutex); > write_lock(&journal->j_state_lock); > J_ASSERT(!journal->j_running_transaction); > diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c > index 78710788c237..1b41bf9f4a7e 100644 > --- a/fs/ocfs2/alloc.c > +++ b/fs/ocfs2/alloc.c > @@ -6020,7 +6020,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) > * Then truncate log will be replayed resulting in cluster double free. > */ > jbd2_journal_lock_updates(journal->j_journal); > - status = jbd2_journal_flush(journal->j_journal); > + status = jbd2_journal_flush(journal->j_journal, 0); > jbd2_journal_unlock_updates(journal->j_journal); > if (status < 0) { > mlog_errno(status); > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c > index db52e843002a..a1438548747e 100644 > --- a/fs/ocfs2/journal.c > +++ b/fs/ocfs2/journal.c > @@ -310,7 +310,7 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb) > } > > jbd2_journal_lock_updates(journal->j_journal); > - status = jbd2_journal_flush(journal->j_journal); > + status = jbd2_journal_flush(journal->j_journal, 0); > jbd2_journal_unlock_updates(journal->j_journal); > if (status < 0) { > up_write(&journal->j_trans_barrier); > @@ -1002,7 +1002,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) > > if (ocfs2_mount_local(osb)) { > jbd2_journal_lock_updates(journal->j_journal); > - status = jbd2_journal_flush(journal->j_journal); > + status = jbd2_journal_flush(journal->j_journal, 0); > jbd2_journal_unlock_updates(journal->j_journal); > if (status < 0) > mlog_errno(status); > @@ -1072,7 +1072,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) > > if (replayed) { > jbd2_journal_lock_updates(journal->j_journal); > - status = jbd2_journal_flush(journal->j_journal); > + status = jbd2_journal_flush(journal->j_journal, 0); > jbd2_journal_unlock_updates(journal->j_journal); > if (status < 0) > mlog_errno(status); > @@ -1668,7 +1668,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, > > /* wipe the journal */ > jbd2_journal_lock_updates(journal); > - status = jbd2_journal_flush(journal); > + status = jbd2_journal_flush(journal, 0); > jbd2_journal_unlock_updates(journal); > if (status < 0) > mlog_errno(status); > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > index 99d3cd051ac3..50510473283e 100644 > --- a/include/linux/jbd2.h > +++ b/include/linux/jbd2.h > @@ -307,6 +307,9 @@ typedef struct journal_superblock_s > JBD2_FEATURE_INCOMPAT_CSUM_V3 | \ > JBD2_FEATURE_INCOMPAT_FAST_COMMIT) > > +/* discard journal blocks flag for jbd2_journal_flush */ > +#define JBD2_FLAG_DO_DISCARD 1 > + > #ifdef __KERNEL__ > > #include <linux/fs.h> > @@ -1491,7 +1494,7 @@ extern int jbd2_journal_invalidatepage(journal_t *, > struct page *, unsigned int, unsigned int); > extern int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page); > extern int jbd2_journal_stop(handle_t *); > -extern int jbd2_journal_flush (journal_t *); > +extern int jbd2_journal_flush(journal_t *journal, unsigned long long flags); > extern void jbd2_journal_lock_updates (journal_t *); > extern void jbd2_journal_unlock_updates (journal_t *); > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index f44eb0a04afd..d66408c38c1d 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -214,6 +214,7 @@ struct fsxattr { > #define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr) > #define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX]) > #define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX]) > +#define FS_IOC_CHKPT_JRNL _IOW('f', 43, __u64) > > /* > * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS) > @@ -304,4 +305,7 @@ typedef int __bitwise __kernel_rwf_t; > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > RWF_APPEND) > > +/* flag for ioctl FS_IOC_JRNL_CHKPT */ > +#define CHKPT_JRNL_DO_DISCARD 1 FS_IOC_CHECKPOINT_FLAG_DISCARD, to go with FS_IOC_CHECKPOINT? --D > + > #endif /* _UAPI_LINUX_FS_H */ > -- > 2.31.0.208.g409f899ff0-goog > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-07 18:35 ` Darrick J. Wong @ 2021-04-07 21:15 ` Dave Chinner 2021-04-08 1:26 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2021-04-07 21:15 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Leah Rumancik, linux-ext4 On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote: > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > > With the filename wipeout patch, Ext4 guarantees that all data will be > > discarded on deletion. This ioctl allows for periodically discarding > > journal contents too. > > This needs a documentation update to cover what this new userspace ABI > does, and probably linux-api and linux-fsdevel should be cc'd. You need to describe the semantics that you are exporting to userspace. Exactly what does a "journal checkpoint" mean from the point of view of user visible metadata and data integrity? How is it different to running fsync() on a directory, syncfs() or freeze on a filesystem, or any of the other methods we already have for guaranteeing completed changes are committed to stable storage? All of these methods imply a journal checkpoint of some kind is done in ext4, so why do we need a specific ioctl to do this? But, really, if this is for secure deletion, then why isn't "fsync(dirfd)" sufficient to force the unlinks into the journal and onto stable storage? Why does this functionality need some special new CAP_SYS_ADMIN only ioctl to checkpoint the journal when, by definition, fsync() should already be doing that? Indeed, users can't actually use this new ioctl for secure deletion because it is root only, so how do users and their applications actually ensure that secure deletion of their files has actually occurred? I'm also curious as to what "journal checkpoint" means for filesystems that don't have journals? Or that have multiple and/or partial state journals (e.g. iper-inode journals in NOVA, fsync journals in brtfs, etc)? What does it mean for filesystems that use COW instead of journals? > > Also, add journal discard (if discard supported) during journal load > > after recovery. This provides a potential solution to > > https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ for > > disks that support discard. After a successful journal recovery, e2fsck > > can call this ioctl to discard the journal as well. If you need discard after recovery for secure remove, then you also really need discard on every extent being freed, too. Which then implies that the -o discard mount option needs to be used in conjunction with this functionality. Perhaps, then, journal discard at mount should be tied in to the -o discard mount option, and then the need for an ioctl to trigger this goes away completely. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-07 21:15 ` Dave Chinner @ 2021-04-08 1:26 ` Darrick J. Wong 2021-04-08 4:43 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2021-04-08 1:26 UTC (permalink / raw) To: Dave Chinner; +Cc: Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote: > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote: > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > > > With the filename wipeout patch, Ext4 guarantees that all data will be > > > discarded on deletion. This ioctl allows for periodically discarding > > > journal contents too. > > > > This needs a documentation update to cover what this new userspace ABI > > does, and probably linux-api and linux-fsdevel should be cc'd. > > You need to describe the semantics that you are exporting to > userspace. Exactly what does a "journal checkpoint" mean from the > point of view of user visible metadata and data integrity? To be clear, my interests are /not/ the same as Leah's here -- I want to use this "checkpoint" call to provide a way for grub to know that it will be able to find boot files without needing to recover the log. For the grub use case, the user-visible behaviors that are required are: 1. All dirty file data in memory are flushed; 2. All committed metadata updates are forced to the ondisk log; 3. All log contents have been written into the filesystem. (Note confusing terminology here: in my head, #2 means checkpointing the ondisk log, whereas #3 means checkpointing the filesystem itself; and "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just the log.) ((For Leah's use case, I think you'd add "4. If the DISCARD flag is set, the journal storage will be zeroed." or something like that...)) > How is it different to running fsync() on a directory, syncfs() or > freeze on a filesystem, or any of the other methods we already have > for guaranteeing completed changes are committed to stable storage? This differs from fsync and syncfs in that it's sufficient to checkpoint the log (#2), whereas this new call would require also checkpointing the filesystem (#3). It's different from FIFREEZE in that it's not necessary to freeze ongoing modifications from running programs. The guarantees only apply to changes made before the call. A second difference is that if we made grub initiate a FIFREEZE/FITHAW cycle, the FITHAW call will unfreeze the filesystem even if another racing program had frozen the fs after grub wrote its files. > All of these methods imply a journal checkpoint of some kind is done > in ext4, so why do we need a specific ioctl to do this? For XFS, we don't have any kind of system call that will checkpoint the fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT there's no explicit way to force a fs checkpoint in ext4 aside from contorted insanity with data=journal files and bmap(). Weird things like NOVA would have to figure out a way to checkpoint the whole fs (flush all the journals?). btrfs can probably get away with flushing the disk cache since it has COW btrees for metadata (fsync log notwithstanding); and I'd imagine stupid things like FAT would just return EOPNOTSUPP. To solve my stupid grub problem, this could easily be: ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS); Though you'd have to add that new syncfs2 syscall. Probably a good way to get yourself invited to LSFMM. ;) Anyway, I'll let Leah lead the secure deletion aspects of this discussion. --D > But, really, if this is for secure deletion, then why isn't > "fsync(dirfd)" sufficient to force the unlinks into the journal and > onto stable storage? Why does this functionality need some special > new CAP_SYS_ADMIN only ioctl to checkpoint the journal when, by > definition, fsync() should already be doing that? Indeed, users can't > actually use this new ioctl for secure deletion because it is root > only, so how do users and their applications actually ensure that > secure deletion of their files has actually occurred? > > I'm also curious as to what "journal checkpoint" means for > filesystems that don't have journals? Or that have multiple and/or > partial state journals (e.g. iper-inode journals in NOVA, fsync > journals in brtfs, etc)? What does it mean for filesystems that use > COW instead of journals? > > > > Also, add journal discard (if discard supported) during journal load > > > after recovery. This provides a potential solution to > > > https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ for > > > disks that support discard. After a successful journal recovery, e2fsck > > > can call this ioctl to discard the journal as well. > > If you need discard after recovery for secure remove, then you also > really need discard on every extent being freed, too. Which then > implies that the -o discard mount option needs to be used in > conjunction with this functionality. Perhaps, then, journal discard > at mount should be tied in to the -o discard mount option, and then > the need for an ioctl to trigger this goes away completely. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-08 1:26 ` Darrick J. Wong @ 2021-04-08 4:43 ` Dave Chinner 2021-04-08 23:49 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2021-04-08 4:43 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Leah Rumancik, linux-ext4 On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote: > On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote: > > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote: > > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > > > > With the filename wipeout patch, Ext4 guarantees that all data will be > > > > discarded on deletion. This ioctl allows for periodically discarding > > > > journal contents too. > > > > > > This needs a documentation update to cover what this new userspace ABI > > > does, and probably linux-api and linux-fsdevel should be cc'd. > > > > You need to describe the semantics that you are exporting to > > userspace. Exactly what does a "journal checkpoint" mean from the > > point of view of user visible metadata and data integrity? > > To be clear, my interests are /not/ the same as Leah's here -- I want to > use this "checkpoint" call to provide a way for grub to know that it > will be able to find boot files without needing to recover the log. > > For the grub use case, the user-visible behaviors that are required are: > > 1. All dirty file data in memory are flushed; > 2. All committed metadata updates are forced to the ondisk log; > 3. All log contents have been written into the filesystem. > > (Note confusing terminology here: in my head, #2 means checkpointing the > ondisk log, whereas #3 means checkpointing the filesystem itself; and > "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just > the log.) So, yeah, you just renamed the ioctl because you are clearly not just talking about a journal checkpoint. A journal checkpoint is what XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log is what we call #3 - basically bringing it to an empty, idle state. Which makes me even more concerned about defining the behaviour and semantics needed before we even talk about the API that would be used. > > All of these methods imply a journal checkpoint of some kind is done > > in ext4, so why do we need a specific ioctl to do this? > > For XFS, we don't have any kind of system call that will checkpoint the > fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT > there's no explicit way to force a fs checkpoint in ext4 aside from > contorted insanity with data=journal files and bmap(). Weird things > like NOVA would have to figure out a way to checkpoint the whole fs > (flush all the journals?). So, yeah, you're not talking about a journal checkpoint. You're describing a completely different set of requirements.... :/ > btrfs can probably get away with flushing the disk cache since it has > COW btrees for metadata (fsync log notwithstanding); and I'd imagine > stupid things like FAT would just return EOPNOTSUPP. > > To solve my stupid grub problem, this could easily be: > > ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS); Sure, but is that the same thing that Leah needs? Checkpoints don't imply discards (let alone journal discards) in any way, and adding (optional) discard support for random parts of filesysetms to syncfs() semantics doesn't seem like a very good fit... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-08 4:43 ` Dave Chinner @ 2021-04-08 23:49 ` Darrick J. Wong 2021-04-11 23:13 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2021-04-08 23:49 UTC (permalink / raw) To: Dave Chinner; +Cc: Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 02:43:27PM +1000, Dave Chinner wrote: > On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote: > > On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote: > > > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote: > > > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > > > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > > > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > > > > > With the filename wipeout patch, Ext4 guarantees that all data will be > > > > > discarded on deletion. This ioctl allows for periodically discarding > > > > > journal contents too. > > > > > > > > This needs a documentation update to cover what this new userspace ABI > > > > does, and probably linux-api and linux-fsdevel should be cc'd. > > > > > > You need to describe the semantics that you are exporting to > > > userspace. Exactly what does a "journal checkpoint" mean from the > > > point of view of user visible metadata and data integrity? > > > > To be clear, my interests are /not/ the same as Leah's here -- I want to > > use this "checkpoint" call to provide a way for grub to know that it > > will be able to find boot files without needing to recover the log. > > > > For the grub use case, the user-visible behaviors that are required are: > > > > 1. All dirty file data in memory are flushed; > > 2. All committed metadata updates are forced to the ondisk log; > > 3. All log contents have been written into the filesystem. > > > > (Note confusing terminology here: in my head, #2 means checkpointing the > > ondisk log, whereas #3 means checkpointing the filesystem itself; and > > "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just > > the log.) > > So, yeah, you just renamed the ioctl because you are clearly not just > talking about a journal checkpoint. A journal checkpoint is what > XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log > is what we call #3 - basically bringing it to an empty, idle state. > > Which makes me even more concerned about defining the behaviour and > semantics needed before we even talk about the API that would be > used. Ok, let's draft a manpage. Here's my mockup of a manpage for the ioctl, though again, I don't have a strong preference between this and a syncfs2 call. NAME ioctl_fs_ioc_checkpoint - Commit all filesystem changes to disk SYNOPSYS int ioctl(int fd, FS_IOC_CHECKPOINT, __u64 *flags); DESCRIPTION Ensure that all previous changes to the filesystem backing the given file descriptor are persisted to disk in the same form that they would be if the filesystem were to be unmounted cleanly. Changes made during or after this call are not required to be persisted. The motivation of this ioctl are twofold -- first, to provide a way for application software to prepare a mounted filesystem for future read-only access by primordial external applications (e.g. bootloaders) that do not know about crash recovery. The second motivation is to provide a way to clean out ephemeral areas of the filesystem involved in crash recovery for security cleaning purposes. FLAGS The flags argument should point to a __u64 containing any combination of the following flags: FS_IOC_CHECKPOINT_DISCARD_STAGING Issue a discard command to erase all areas of the filesystem that are involved in staging and committing updates. ERRORS Error codes can be one of, but are not limited to, the following: EFSCORRUPTED Filesystem corruption was detected. EINVAL One of the arguments is not valid. ENOMEM Not enough memory. ENOSPC Not enough space. <etc> > > > > All of these methods imply a journal checkpoint of some kind is done > > > in ext4, so why do we need a specific ioctl to do this? > > > > For XFS, we don't have any kind of system call that will checkpoint the > > fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT > > there's no explicit way to force a fs checkpoint in ext4 aside from > > contorted insanity with data=journal files and bmap(). Weird things > > like NOVA would have to figure out a way to checkpoint the whole fs > > (flush all the journals?). > > So, yeah, you're not talking about a journal checkpoint. You're > describing a completely different set of requirements.... :/ Yes, I'm talking about making sure that we've written changes back into the whole fs, not just the journal. > > btrfs can probably get away with flushing the disk cache since it has > > COW btrees for metadata (fsync log notwithstanding); and I'd imagine > > stupid things like FAT would just return EOPNOTSUPP. > > > > To solve my stupid grub problem, this could easily be: > > > > ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS); > > Sure, but is that the same thing that Leah needs? Checkpoints don't > imply discards (let alone journal discards) in any way, and adding > (optional) discard support for random parts of filesysetms to > syncfs() semantics doesn't seem like a very good fit... Yes. I think Leah and Ted are more inclined to go with an ioctl since this is something that's peculiar to journalled filesystems. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL 2021-04-08 23:49 ` Darrick J. Wong @ 2021-04-11 23:13 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2021-04-11 23:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Leah Rumancik, linux-ext4 On Thu, Apr 08, 2021 at 04:49:09PM -0700, Darrick J. Wong wrote: > On Thu, Apr 08, 2021 at 02:43:27PM +1000, Dave Chinner wrote: > > On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote: > > > On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote: > > > > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote: > > > > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote: > > > > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the > > > > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded. > > > > > > With the filename wipeout patch, Ext4 guarantees that all data will be > > > > > > discarded on deletion. This ioctl allows for periodically discarding > > > > > > journal contents too. > > > > > > > > > > This needs a documentation update to cover what this new userspace ABI > > > > > does, and probably linux-api and linux-fsdevel should be cc'd. > > > > > > > > You need to describe the semantics that you are exporting to > > > > userspace. Exactly what does a "journal checkpoint" mean from the > > > > point of view of user visible metadata and data integrity? > > > > > > To be clear, my interests are /not/ the same as Leah's here -- I want to > > > use this "checkpoint" call to provide a way for grub to know that it > > > will be able to find boot files without needing to recover the log. > > > > > > For the grub use case, the user-visible behaviors that are required are: > > > > > > 1. All dirty file data in memory are flushed; > > > 2. All committed metadata updates are forced to the ondisk log; > > > 3. All log contents have been written into the filesystem. > > > > > > (Note confusing terminology here: in my head, #2 means checkpointing the > > > ondisk log, whereas #3 means checkpointing the filesystem itself; and > > > "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just > > > the log.) > > > > So, yeah, you just renamed the ioctl because you are clearly not just > > talking about a journal checkpoint. A journal checkpoint is what > > XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log > > is what we call #3 - basically bringing it to an empty, idle state. > > > > Which makes me even more concerned about defining the behaviour and > > semantics needed before we even talk about the API that would be > > used. > > Ok, let's draft a manpage. Here's my mockup of a manpage for the ioctl, > though again, I don't have a strong preference between this and a > syncfs2 call. I'd much prefer a syscall as we already have sync() and syncfs() syscalls to do very similar things. > NAME > > ioctl_fs_ioc_checkpoint - Commit all filesystem changes to disk > > SYNOPSYS > > int ioctl(int fd, FS_IOC_CHECKPOINT, __u64 *flags); > > DESCRIPTION > > Ensure that all previous changes to the filesystem backing the given > file descriptor are persisted to disk in the same form that they would > be if the filesystem were to be unmounted cleanly. Changes made during > or after this call are not required to be persisted. What does "unmounted cleanly" actually mean? An application writer has no idea what this might mean for their application. Also, what happens with a filesystem that "cleanly unmounts" but still requires work to be done on/after the next mount? e.g. per-inode journals might only get replayed when the inode is next referenced, not when the filesystem mounts. IMO, if this is going to force a fully consistent on-disk structure, it needs to be described as providing similar guarantees as a fsfreeze(8), but only for modifications completed before the checkpoint is issued. i.e. a read-only mount on a read-only block device will see all the changes completed before the checkpoint was taken. > The motivation of this ioctl are twofold -- first, to provide a way for > application software to prepare a mounted filesystem for future > read-only access by primordial external applications (e.g. bootloaders) > that do not know about crash recovery. Yup, moving on-disk state to an external read-only access compatible state is the requirement, not "cleanly unmounted". /me wonders how we plan to prevent modifications to files and metadata writeback during the checkpointing process from resulting in inconsistent read-only state on disk. e.g. concurrent directory operations that are partialy written back while the checkpoint is doing it's magic writing back active metadata in the journal might result in directories being in inconsistent state on disk, even though all the modifications that happened prior to the checkpoint were captured. It's problems like these that make me ask how the "read-only access" guarantee is going to work if we haven't actually frozen the filesystem to prevent concurrent modifications leaking into the on-disk state and making it inconsistent.... > The second motivation is to > provide a way to clean out ephemeral areas of the filesystem involved in > crash recovery for security cleaning purposes. > > FLAGS > > The flags argument should point to a __u64 containing any combination of > the following flags: > > FS_IOC_CHECKPOINT_DISCARD_STAGING > Issue a discard command to erase all areas of the filesystem > that are involved in staging and committing updates. I don't think this should be conflated with a filesystem checkpoint. It is, fundamentally, a completely different operation, and one that might be exceedingly complex and slow to implement. e.g. does this definition imply deleting and discarding all the previous versions of now unreferenced metadata in a COW filesytem? iWhat does it imply for log structured filesystems? What about filesystems that can discard journal/staging areas without checkpointing? Sure, discarding the journal might require a full journal checkpoint for filesystems like ext4, but I can see that we don't actually need to implement "read-only" guarantees for XFS to implement a journal discard. For XFS, we'd just need to force the log, push AIL to the required target LSN, force the log again to update the on-disk tail, take the CIL checkpoint write lock to prevent the head moving forward, and now discard all the unused journal between the head and tail. IOWs, there is -no requirement- for the filesystem to be in a read-only access compatible state to discard the unused parts of the journal - we only need to hold off journal writes for a short period of time to do the discards.... Hence I think "journal discard" needs to be a separate syscall/ioctl, not get conflated with a filesystem checkpoint operation. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-04-11 23:38 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-07 15:42 [PATCH v2 0/2] Filename wipeout patch series updates Leah Rumancik 2021-04-07 15:42 ` [PATCH v2 1/2] ext4: wipe filename upon file deletion Leah Rumancik 2021-04-07 21:33 ` Eric Biggers 2021-04-08 3:48 ` Theodore Ts'o 2021-04-08 5:21 ` Dave Chinner 2021-04-08 19:25 ` Theodore Ts'o 2021-04-09 0:02 ` Darrick J. Wong 2021-04-09 2:51 ` Theodore Ts'o 2021-04-11 23:38 ` Dave Chinner 2021-04-07 15:42 ` [PATCH v2 2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL Leah Rumancik 2021-04-07 18:35 ` Darrick J. Wong 2021-04-07 21:15 ` Dave Chinner 2021-04-08 1:26 ` Darrick J. Wong 2021-04-08 4:43 ` Dave Chinner 2021-04-08 23:49 ` Darrick J. Wong 2021-04-11 23:13 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).