All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about ext4 journal
@ 2015-10-23  4:04 Masanari Iida
       [not found] ` <CAOQ4uxhFw5HCyVuYsKROKvMHJWBCEqZfBM8gtcV1bPQGFXT-hQ@mail.gmail.com>
  2015-10-23 12:50 ` Theodore Ts'o
  0 siblings, 2 replies; 5+ messages in thread
From: Masanari Iida @ 2015-10-23  4:04 UTC (permalink / raw)
  To: linux-ext4

Hello Developer,
I have a question about ext4's internal.

OS: RHEL6.2
Filesystem EXT4
mount option = ordered

My understanding on ext4 with ordered mode,
When a file is created,  data is written to FS block,
At the same time,  metadata is stored into journal,
and then meta data on journal is written to the inode block.
What is the next?

My question is
Does the kernel remove the meta data on journal after each successful
 transaction?

As I see the contents of journal entries in EXT4 using debugfs(8),
the journal entries are growing when creating or deleting the files.
I am curious to know what make the system to remove journal entries
while mounted the fs.

Background of the question.
I have encountered a case that when I delete and create some files,
journal entry for deleting the file exist
But journal entry for creating the file was not exist.
FYI, the file itself exist when I see it by using debugfs.

I created snapshot of the filesystem and  run fsck on copy image.
Then the file was _removed_ by fsck operation.
This is why I want to know how journal on EXT4 were controlled.

Thanks
Masanari

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: Question about ext4 journal
       [not found] ` <CAOQ4uxhFw5HCyVuYsKROKvMHJWBCEqZfBM8gtcV1bPQGFXT-hQ@mail.gmail.com>
@ 2015-10-23  6:22   ` Amir Goldstein
  2015-10-24  3:10     ` Masanari Iida
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2015-10-23  6:22 UTC (permalink / raw)
  To: Masanari Iida; +Cc: Ext4

On Fri, Oct 23, 2015 at 7:04 AM, Masanari Iida <standby24x7@gmail.com> wrote:
>
> Hello Developer,


Hi Masanari,

>
> I have a question about ext4's internal.
>
> OS: RHEL6.2
> Filesystem EXT4
> mount option = ordered
>
> My understanding on ext4 with ordered mode,
> When a file is created,  data is written to FS block,
> At the same time,  metadata is stored into journal,
> and then meta data on journal is written to the inode block.
> What is the next?

there is also the metadata of the directory entry created for the new file
that gets journaled as well
>
>
> My question is
> Does the kernel remove the meta data on journal after each successful
>  transaction?

no. journal only logs transactions
>
>
> As I see the contents of journal entries in EXT4 using debugfs(8),
> the journal entries are growing when creating or deleting the files.
> I am curious to know what make the system to remove journal entries
> while mounted the fs.

it's called the orphan inode list.
every deleted inode gets inserted into the list, then deleted, then
removed from orphan inodes list.
on mount, fs starts by "playing" the journal, then clearing the orphan
inodes list

>
> Background of the question.
> I have encountered a case that when I delete and create some files,
> journal entry for deleting the file exist
> But journal entry for creating the file was not exist.
> FYI, the file itself exist when I see it by using debugfs.
>
> I created snapshot of the filesystem and  run fsck on copy image.
> Then the file was _removed_ by fsck operation.

fsck starts by "playing" the journal. then I think it will ask about
clearing the orphan inode list. can't remember.

>
> This is why I want to know how journal on EXT4 were controlled.
>
> Thanks
> Masanari
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 journal
  2015-10-23  4:04 Question about ext4 journal Masanari Iida
       [not found] ` <CAOQ4uxhFw5HCyVuYsKROKvMHJWBCEqZfBM8gtcV1bPQGFXT-hQ@mail.gmail.com>
@ 2015-10-23 12:50 ` Theodore Ts'o
  1 sibling, 0 replies; 5+ messages in thread
From: Theodore Ts'o @ 2015-10-23 12:50 UTC (permalink / raw)
  To: Masanari Iida; +Cc: linux-ext4

On Fri, Oct 23, 2015 at 01:04:54PM +0900, Masanari Iida wrote:
> Hello Developer,
> I have a question about ext4's internal.
> 
> OS: RHEL6.2
> Filesystem EXT4
> mount option = ordered
> 
> My understanding on ext4 with ordered mode,
> When a file is created,  data is written to FS block,
> At the same time,  metadata is stored into journal,
> and then meta data on journal is written to the inode block.
> What is the next?

Well, that's not quite a complete picture.  Ext4 has an advanced
feature called delayed allocation, which means that the FS block is
not allocated until writeback occurs.

So there is some file system metadata which is modified as soon as the
file is created (i.e., the directory, inode allocation bitmap, the
inode table block itself), and this is held in memory until the
journal commit is triggered (either every five seconds, or if the size
of the transaction grows beyond a certian size, or an fsync), at which
point the metadata blocks that have been modified since the last
commit are written into the journal, and once the commit block is
written, the modified metadata blocks are _allowed_ to be written back
to disk by the normal writeback mechanisms.

When the data writeback timer expires (30 seconds by default), then
writeback happens.  It's only then that the location on disk is
determined, and when the block is allocated, this will result in more
metadata blocks getting modified, which are handled as described
above.  In general once we've allocated the block, the write to disk
is immediately scheduled, and the commit that commits the will happen
shortly after.

> My question is
> Does the kernel remove the meta data on journal after each successful
>  transaction?

The journal is a circular buffer.  Once all of the blocks that
participated in the a jbd2 transaction have been written back to their
final location on disk, the transaction gets retired.  However, we
don't necessarily automatically update the jbd2 superblock's tail
pointer each time a transaction can be retired, because doing this to
"remove" one or more transsaction requires a write to the jbd2
superblock, and we want to minimize unnecessary writes.  This might
mean that when we recover after a crash, we might end up replaying
some transcations that don't need to be replayed, but that should be
an uncommon case that we shouldn't be optimizing for.

> As I see the contents of journal entries in EXT4 using debugfs(8),
> the journal entries are growing when creating or deleting the files.
> I am curious to know what make the system to remove journal entries
> while mounted the fs.
> 
> Background of the question.
> I have encountered a case that when I delete and create some files,
> journal entry for deleting the file exist
> But journal entry for creating the file was not exist.
> FYI, the file itself exist when I see it by using debugfs.
> 
> I created snapshot of the filesystem and  run fsck on copy image.
> Then the file was _removed_ by fsck operation.
> This is why I want to know how journal on EXT4 were controlled.

This is too vague for me to comment.  If you give very detailed of
what file system operations you might have been trying to do, and
whether you called fsync(2) or not, and how long you waited before
taking the snapshot, that would be helpful.

I will observe that because of delayed allocation, if you don't wait
for the writeback timer to expire, if you take a snapshot or there is
a crash immediately after writing the file, what you might find after
the recovery process is a zero-length file.  If you want to make sure
a file and its contents will be there after a crash, make sure you call
the fsync() system call.

						- Ted
						

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 journal
  2015-10-23  6:22   ` Fwd: " Amir Goldstein
@ 2015-10-24  3:10     ` Masanari Iida
  2015-10-24 11:23       ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Masanari Iida @ 2015-10-24  3:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Ext4

On Fri, Oct 23, 2015 at 3:22 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Oct 23, 2015 at 7:04 AM, Masanari Iida <standby24x7@gmail.com> wrote:
>>
>> As I see the contents of journal entries in EXT4 using debugfs(8),
>> the journal entries are growing when creating or deleting the files.
>> I am curious to know what make the system to remove journal entries
>> while mounted the fs.
>
> it's called the orphan inode list.
> every deleted inode gets inserted into the list, then deleted, then
> removed from orphan inodes list.
> on mount, fs starts by "playing" the journal, then clearing the orphan
> inodes list
>
Hello Amir,   Ted,

My test case
(1)   Create a small file by vi.  (about 1000bytes)
(2)   Delete the file "rm -f"
===== within 1 min =====
(3)   Create a small file again with same file name as (1).
=====  12 hours of no operation ,  no umount, no mount, no fsck ======
(4)  Snapshot ( created by storage device )
===== some days ======
(5)  Run fsck the fs image and mount the fs

I understand that in step2,  when the file delete happen,
   inode gets insert into orphan list,
         delete the file data
             inode gets deleted from orphan list.

My wild guess is
file was created in inode #100 (for example),
it was deleted and inode #100 moved to orphan list.
The inode #100 was re-used in step 3.
snapshot taken.
Run fsck.   inode #100 was found in orphan list, so the file with same
inode was deleted during fsck.

Did we have such trouble during 2.6.32 time?

Regards,
Masanari

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 journal
  2015-10-24  3:10     ` Masanari Iida
@ 2015-10-24 11:23       ` Theodore Ts'o
  0 siblings, 0 replies; 5+ messages in thread
From: Theodore Ts'o @ 2015-10-24 11:23 UTC (permalink / raw)
  To: Masanari Iida; +Cc: Amir Goldstein, Ext4

On Sat, Oct 24, 2015 at 12:10:48PM +0900, Masanari Iida wrote:
> 
> My wild guess is
> file was created in inode #100 (for example),
> it was deleted and inode #100 moved to orphan list.
> The inode #100 was re-used in step 3.

The inode is removed from the orphan list as soon as the inode is
released.  The only time an inode remains on the orphan list is if
some other process had an open file descriptor on the file before it
was unlinked.  But in that case, when you recreate the file with the
same name, you will get a new inode, because the old inode has not yet
been released.  Once it has been released, it is removed from the orphan list.

> snapshot taken.

If the storage device is not synchronized with the file system then
there is no guarantee that the snapshot will be consistent.  But given
that you said you waited for hours before taking the snapshot, and it
seems likely that isn't the problem.

> Did we have such trouble during 2.6.32 time?

I don't recall anything like that.  But if you are still using 2.6.32,
then presumably you are on some enterprise distro.  I suggest you open
a support call with your enterprise distro provider, and/or with your
storage device (since I assume you're using some kind of storage array
that has a proprietary device driver).

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-10-24 11:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23  4:04 Question about ext4 journal Masanari Iida
     [not found] ` <CAOQ4uxhFw5HCyVuYsKROKvMHJWBCEqZfBM8gtcV1bPQGFXT-hQ@mail.gmail.com>
2015-10-23  6:22   ` Fwd: " Amir Goldstein
2015-10-24  3:10     ` Masanari Iida
2015-10-24 11:23       ` Theodore Ts'o
2015-10-23 12:50 ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.