All of lore.kernel.org
 help / color / mirror / Atom feed
* Directory unremovable on ext4 no_journal mode
@ 2018-04-10  0:08 Jayashree Mohan
  2018-04-10  0:38 ` Darrick J. Wong
  2018-04-10  3:12 ` Theodore Y. Ts'o
  0 siblings, 2 replies; 5+ messages in thread
From: Jayashree Mohan @ 2018-04-10  0:08 UTC (permalink / raw)
  To: linux-ext4, fstests; +Cc: Vijaychidambaram Velayudhan Pillai

Hi,

We stumbled upon what seems to be a bug that makes a “directory
unremovable”,  on ext4 when mounted with no_journal option.

A sequence of operations described below led to the following state :
“A directory that was renamed, was persisted in both parent and target
directories, with the same inode number. This also means the rename
was non-atomic on storage. In addition, the renamed directory becomes
unremovable on the target with FS-error logged in dmesg.”

Here are more details of the workload and the corresponding failure.

Workload :

mkdir /mnt/test/X and /mnt/test/Y
mkdir X/Z
sync()
rename X/Z   Y/Z
fsync Y
—-Crash now—-
Remount
ls X and Y (You will see Z is present in both directories X and Y, and
has same inode)
rmdir test_dir/X/Z  (This succeeds)
rmdir test_dir/Y/Z  (This fails with a FS error logged in dmesg)


Results:

rmdir: failed to remove '/mnt/test/Y/Z': Structure needs cleaning

The corresponding dmesg log has the following error message :
[66799.504124] EXT4-fs error (device cow_ram_snapshot1_0):
ext4_lookup:1576: inode #12: comm rmdir: deleted inode referenced: 14
[66799.504131] EXT4-fs (cow_ram_snapshot1_0): Remounting filesystem read-only

The sequence of operations listed above is making dir Z unremovable
from dir Y, which seems like unexpected behavior. Could you provide
more details on the reason for such behavior? We understand we run
this on no_journal mode of ext4, but would like you to verify if this
behavior is acceptable.

Do let us know if we are missing any detail here.

Thanks,
Jayashree Mohan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Directory unremovable on ext4 no_journal mode
  2018-04-10  0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan
@ 2018-04-10  0:38 ` Darrick J. Wong
  2018-04-10  3:12 ` Theodore Y. Ts'o
  1 sibling, 0 replies; 5+ messages in thread
From: Darrick J. Wong @ 2018-04-10  0:38 UTC (permalink / raw)
  To: Jayashree Mohan; +Cc: linux-ext4, fstests, Vijaychidambaram Velayudhan Pillai

On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote:
> Hi,
> 
> We stumbled upon what seems to be a bug that makes a “directory
> unremovable”,  on ext4 when mounted with no_journal option.
> 
> A sequence of operations described below led to the following state :
> “A directory that was renamed, was persisted in both parent and target
> directories, with the same inode number. This also means the rename
> was non-atomic on storage. In addition, the renamed directory becomes
> unremovable on the target with FS-error logged in dmesg.”
> 
> Here are more details of the workload and the corresponding failure.
> 
> Workload :
> 
> mkdir /mnt/test/X and /mnt/test/Y
> mkdir X/Z
> sync()
> rename X/Z   Y/Z
> fsync Y
> —-Crash now—-
> Remount

You're supposed to run e2fsck after a crash to clean up the metadata.
nojournal disables the piece that takes care of that.

--D

> ls X and Y (You will see Z is present in both directories X and Y, and
> has same inode)
> rmdir test_dir/X/Z  (This succeeds)
> rmdir test_dir/Y/Z  (This fails with a FS error logged in dmesg)
> 
> 
> Results:
> 
> rmdir: failed to remove '/mnt/test/Y/Z': Structure needs cleaning
> 
> The corresponding dmesg log has the following error message :
> [66799.504124] EXT4-fs error (device cow_ram_snapshot1_0):
> ext4_lookup:1576: inode #12: comm rmdir: deleted inode referenced: 14
> [66799.504131] EXT4-fs (cow_ram_snapshot1_0): Remounting filesystem read-only
> 
> The sequence of operations listed above is making dir Z unremovable
> from dir Y, which seems like unexpected behavior. Could you provide
> more details on the reason for such behavior? We understand we run
> this on no_journal mode of ext4, but would like you to verify if this
> behavior is acceptable.
> 
> Do let us know if we are missing any detail here.
> 
> Thanks,
> Jayashree Mohan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Directory unremovable on ext4 no_journal mode
  2018-04-10  0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan
  2018-04-10  0:38 ` Darrick J. Wong
@ 2018-04-10  3:12 ` Theodore Y. Ts'o
  2018-04-10  3:21   ` Vijay Chidambaram
  1 sibling, 1 reply; 5+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-10  3:12 UTC (permalink / raw)
  To: Jayashree Mohan; +Cc: linux-ext4, fstests, Vijaychidambaram Velayudhan Pillai

On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote:
> Hi,
> 
> We stumbled upon what seems to be a bug that makes a “directory
> unremovable”,  on ext4 when mounted with no_journal option.

Hi Jayashree,

If you use no_journal mode, you **must** run e2fsck after a crash.
And you do have to potentially be ready for data loss after a crash.
So no, this isn't a bug.  The guarantees that you have when use
no_journal is essentially limited to what Posix specifies when you
crash uncleanly --- "the results are undefined".

> The sequence of operations listed above is making dir Z unremovable
> from dir Y, which seems like unexpected behavior. Could you provide
> more details on the reason for such behavior? We understand we run
> this on no_journal mode of ext4, but would like you to verify if this
> behavior is acceptable.

We use no_journal mode in Google, but we are preprared to effectively
reinstall the root partition, and we are prepared to lose data on our
data disks, after a crash.  We are OK with this because all persistent
data stored on machines is data we are prepared to lose (e.g., cached
data or easily reinstalled system software) or part of our cluster
file system, where we use erasure codes to assure that data in the
cluster file system can remain accessible even if (a) a disk dies
completely, or (b) the entry router on the rack dies, denying access
to all of the disks in a rack from the cluster file system until the
router can be repaired.  So losing a file or a directory after running
e2fsck after a crash is actually small beer compared to any number of
other things that can happen to a disk.

The goal for no_journal mode is performance at all costs, and we are
prepared to sacrifice file system robustness after a crash.  This
means we aren't doing any kind of FUA writes or CACHE FLUSH
operations, because those would compromise performance.  (As a thought
experiment, I would encouraging you to try to design a file system
that would provide better guarantees without using FUA writes, CACHE
FLUSH operations, and with the HDD's write-back cache enabled.)

To understand why this is so important, I would recommend that you
read the "Disks for Data Center" paper[1].  There is also a lot of
good stuff in the FAST 2016 keynote that isn't in the paper or the
slides.  So listening to the audio recording is also something I
strongly commend for people who want to understand Google's approach
to storage.  (Before 2016, we had always considered this part of our
"secret sauce" that we had never disclosed for the past decade, since
it is what gave us a huge storage TCO advantage over other companies.)

[1] https://research.google.com/pubs/pub44830.html
[2] https://www.usenix.org/node/194391

Essentially, we are trying to use all of the two baskets of value
provided by each HDD.  That is, we want to use nearly all of the byte
capacity and all of the IOPS that an HDD can provide --- and FUA
writes or CACHE FLUSHES significantly compromises the number of I/O
operations the HDD can provide.  (More details about how we do this at
the cluster level can be found in the PDSW 2017 keynote[3], but it
goes well beyond the scope of what gets done on a single file system
on a single HDD.)

[3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf

Regards,

						- Ted

P.S.  This is not to say that the work you are doing with Crashmonkey
et. al. is not useless; it's just not applicable for a cluster file
system in a hyper-scale cloud environment.  Local disk file systems
and robustness after a crash is still important in applications such
as Android and Chrome OS, for example.  Note that we do *not* use
no_journal mode in those environments.  :-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Directory unremovable on ext4 no_journal mode
  2018-04-10  3:12 ` Theodore Y. Ts'o
@ 2018-04-10  3:21   ` Vijay Chidambaram
  2018-04-10 12:07     ` Jayashree Mohan
  0 siblings, 1 reply; 5+ messages in thread
From: Vijay Chidambaram @ 2018-04-10  3:21 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jayashree Mohan, Ext4, fstests

Thanks Ted! This information is very useful. We won't pursue testing
ext4-no-journal further, as there is no problem e2fsck cannot fix if
data loss is tolerated.

I wanted to point you to an old paper of mine that has a similar goal
of performance at all costs: the No Order File System
(http://research.cs.wisc.edu/adsl/Publications/nofs-fast12.pdf). It
doesn't use any FLUSH or FUA instructions, and instead obtains
consistency from mutual agreement between file-system objects. It
requires we are able to atomically write a "backpointer" with each
disk block (perhaps in an out-of-band area). I thought you might find
it interesting!

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

On Mon, Apr 9, 2018 at 10:12 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote:
>> Hi,
>>
>> We stumbled upon what seems to be a bug that makes a “directory
>> unremovable”,  on ext4 when mounted with no_journal option.
>
> Hi Jayashree,
>
> If you use no_journal mode, you **must** run e2fsck after a crash.
> And you do have to potentially be ready for data loss after a crash.
> So no, this isn't a bug.  The guarantees that you have when use
> no_journal is essentially limited to what Posix specifies when you
> crash uncleanly --- "the results are undefined".
>
>> The sequence of operations listed above is making dir Z unremovable
>> from dir Y, which seems like unexpected behavior. Could you provide
>> more details on the reason for such behavior? We understand we run
>> this on no_journal mode of ext4, but would like you to verify if this
>> behavior is acceptable.
>
> We use no_journal mode in Google, but we are preprared to effectively
> reinstall the root partition, and we are prepared to lose data on our
> data disks, after a crash.  We are OK with this because all persistent
> data stored on machines is data we are prepared to lose (e.g., cached
> data or easily reinstalled system software) or part of our cluster
> file system, where we use erasure codes to assure that data in the
> cluster file system can remain accessible even if (a) a disk dies
> completely, or (b) the entry router on the rack dies, denying access
> to all of the disks in a rack from the cluster file system until the
> router can be repaired.  So losing a file or a directory after running
> e2fsck after a crash is actually small beer compared to any number of
> other things that can happen to a disk.
>
> The goal for no_journal mode is performance at all costs, and we are
> prepared to sacrifice file system robustness after a crash.  This
> means we aren't doing any kind of FUA writes or CACHE FLUSH
> operations, because those would compromise performance.  (As a thought
> experiment, I would encouraging you to try to design a file system
> that would provide better guarantees without using FUA writes, CACHE
> FLUSH operations, and with the HDD's write-back cache enabled.)
>
> To understand why this is so important, I would recommend that you
> read the "Disks for Data Center" paper[1].  There is also a lot of
> good stuff in the FAST 2016 keynote that isn't in the paper or the
> slides.  So listening to the audio recording is also something I
> strongly commend for people who want to understand Google's approach
> to storage.  (Before 2016, we had always considered this part of our
> "secret sauce" that we had never disclosed for the past decade, since
> it is what gave us a huge storage TCO advantage over other companies.)
>
> [1] https://research.google.com/pubs/pub44830.html
> [2] https://www.usenix.org/node/194391
>
> Essentially, we are trying to use all of the two baskets of value
> provided by each HDD.  That is, we want to use nearly all of the byte
> capacity and all of the IOPS that an HDD can provide --- and FUA
> writes or CACHE FLUSHES significantly compromises the number of I/O
> operations the HDD can provide.  (More details about how we do this at
> the cluster level can be found in the PDSW 2017 keynote[3], but it
> goes well beyond the scope of what gets done on a single file system
> on a single HDD.)
>
> [3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf
>
> Regards,
>
>                                                 - Ted
>
> P.S.  This is not to say that the work you are doing with Crashmonkey
> et. al. is not useless; it's just not applicable for a cluster file
> system in a hyper-scale cloud environment.  Local disk file systems
> and robustness after a crash is still important in applications such
> as Android and Chrome OS, for example.  Note that we do *not* use
> no_journal mode in those environments.  :-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Directory unremovable on ext4 no_journal mode
  2018-04-10  3:21   ` Vijay Chidambaram
@ 2018-04-10 12:07     ` Jayashree Mohan
  0 siblings, 0 replies; 5+ messages in thread
From: Jayashree Mohan @ 2018-04-10 12:07 UTC (permalink / raw)
  To: Vijaychidambaram Velayudhan Pillai; +Cc: Theodore Y. Ts'o, Ext4, fstests

Hi Ted,
Thank you for the detailed response! It makes things much clearer now.
I understand why no journal mode is used and what guarantees to expect
while using it. Will keep this in mind for future CrashMonkey testing.

Thanks,
Jayashree

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-04-10 12:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-10  0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan
2018-04-10  0:38 ` Darrick J. Wong
2018-04-10  3:12 ` Theodore Y. Ts'o
2018-04-10  3:21   ` Vijay Chidambaram
2018-04-10 12:07     ` Jayashree Mohan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.