[Bug 43260] New: ftruncate locks up when used with direct IO on ext4

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Bug 43260] New: ftruncate locks up when used with direct IO on ext4
@ 2012-05-17 23:31 bugzilla-daemon
  2012-05-18  2:06 ` [Bug 43260] " bugzilla-daemon
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-05-17 23:31 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=43260

           Summary: ftruncate locks up when used with direct IO on ext4
           Product: File System
           Version: 2.5
    Kernel Version: all above 3.1-rc3
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: ext4
        AssignedTo: fs_ext4@kernel-bugs.osdl.org
        ReportedBy: ivan@rethinkdb.com
                CC: tytso@mit.edu
        Regression: No

Created an attachment (id=73323)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=73323)
Test program to reproduce the lock up

Calling ftruncate shortly after submitting a lot of direct IO requests on a
file located on ext4 filesystem causes the ftruncate syscall to lock up. Using
other filesystems, e.g. xfs, does not exhibit this behavior.

This problem can be reproduced on all kernel versions above 3.1-rc3 (3.1-rc3
itself is fine), e.g. on v3.2.14 -- the kernel that is used in the latest
Ubuntu LTS release.

The attached program can be used to reproduce the problem. It is possible to
reproduce the problem by running the program on a temporary ext4 filesystem
inside UML, it is also advisable to do so since other syscalls accessing the
file system may lock up as well after starting the program.

I was able to bisect the problem to this commit:
8c0bec2151a47906bf779c6715a10ce04453ab77.

If you plan to be building the user-mode linux kernel for this range of kernel
commits, you may need to apply the changes from commit
e5f0bdc7840bdb791247cb98dfc1dab6ea6c7da4 which fix the building problem for
ARCH=um.

Keywords: ext4, ftruncate, direct IO, dio
Architecture: amd64 (but likely is architecture-independent)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 43260] ftruncate locks up when used with direct IO on ext4
  2012-05-17 23:31 [Bug 43260] New: ftruncate locks up when used with direct IO on ext4 bugzilla-daemon
@ 2012-05-18  2:06 ` bugzilla-daemon
  2012-05-21 23:08 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-05-18  2:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=43260


Eric Sandeen <sandeen@redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sandeen@redhat.com




--- Comment #1 from Eric Sandeen <sandeen@redhat.com>  2012-05-18 02:06:41 ---
commit details:

commit 8c0bec2151a47906bf779c6715a10ce04453ab77
Author: Jiaying Zhang <jiayingz@google.com>
Date:   Wed Aug 31 11:50:51 2011 -0400

    ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining

    The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
    in ext4_evict_inode() causes lockdep complaining about potential
    deadlock in several places.  In most/all of these LOCKDEP complaints
    it looks like it's a false positive, since many of the potential
    circular locking cases can't take place by the time the
    ext4_evict_inode() is called; but since at the very least it may mask
    real problems, we need to address this.

    This change removes the flush_completed_IO() and i_mutex lock in
    ext4_evict_inode().  Instead, we take a different approach to resolve
    the software lockup that commit 2581fdc810 intends to fix.  Rather
    than having ext4-dio-unwritten thread wait for grabing the i_mutex
    lock of an inode, we use mutex_trylock() instead, and simply requeue
    the work item if we fail to grab the inode's i_mutex lock.

    This should speed up work queue processing in general and also
    prevents the following deadlock scenario: During page fault,
    shrink_icache_memory is called that in turn evicts another inode B.
    Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero.  However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.

    Signed-off-by: Jiaying Zhang <jiayingz@google.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

I tested on the 3.2.10 fedora kernel, though, and it didn't lock up (tried many
times):

# rm -f lockfile; ./ftruncate-test lockfile
This might lock up..
It didn't lock up.


Can you lock it up, and do a sysrq-w and maybe sysrq-d and attach the resulting
dmesg?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 43260] ftruncate locks up when used with direct IO on ext4
  2012-05-17 23:31 [Bug 43260] New: ftruncate locks up when used with direct IO on ext4 bugzilla-daemon
  2012-05-18  2:06 ` [Bug 43260] " bugzilla-daemon
@ 2012-05-21 23:08 ` bugzilla-daemon
  2012-05-24 10:16 ` bugzilla-daemon
  2015-02-19 17:26 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-05-21 23:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=43260





--- Comment #2 from Ivan Tarasov <ivan@rethinkdb.com>  2012-05-21 23:08:04 ---
Created an attachment (id=73345)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=73345)
Output of SysRq-w at lock up

Eric,

You are correct that 3.2.10 does not exhibit the problem (both on
RedHat-patched and vanilla kernel versions). That means that the bug was fixed
between 8c0bec21 and v3.2.10 and then reappeared again between the v3.2.10 and
v3.2.14.

I repeated the bisect, this time between v3.2.10 and v3.2.14, and found this
commit which exhibited the problem again:

commit 8608fb78b2cbf9eb8794e592bf43a8b1884c5a85
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Mon Feb 20 17:59:24 2012 -0500

    ext4: fix race between unwritten extent conversion and truncate

    commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.

    The following comment in ext4_end_io_dio caught my attention:

        /* XXX: probably should move into the real I/O completion handler */
            inode_dio_done(inode);

    The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
    ext4 code path above will end up dropping the mutex before it is
    reacquired by the worker thread that does the extent conversion, it
    seems to me that the truncate can happen out of order.  Jan Kara
    mentioned that this might result in error messages in the system logs,
    but that should be the extent of the "damage."

    The fix is pretty straight-forward: don't call inode_dio_done until the
    extent conversion is complete.

    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The output of sysrq-w during the lock up on this commit is attached.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 43260] ftruncate locks up when used with direct IO on ext4
  2012-05-17 23:31 [Bug 43260] New: ftruncate locks up when used with direct IO on ext4 bugzilla-daemon
  2012-05-18  2:06 ` [Bug 43260] " bugzilla-daemon
  2012-05-21 23:08 ` bugzilla-daemon
@ 2012-05-24 10:16 ` bugzilla-daemon
  2015-02-19 17:26 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-05-24 10:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=43260

Jan Kara <jack@suse.cz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jack@suse.cz

--- Comment #3 from Jan Kara <jack@suse.cz>  2012-05-24 10:16:57 ---
Hum, I see. So we are holding i_mutex and waiting for dio to finish and worker
thread cannot take i_mutex to finish the extent conversion and call
inode_dio_done(). Slightly subtle is that the worker tries to be clever and if
it cannot acquire i_mutex, it requeues the work item so it does not really
deadlock the system, it just eats up CPU cycling the work over and over...

I'm uncertain how to best fix this. If we just revert 266991b13, we reintroduce
the problem of AIO DIO completion racing with truncate (so extent conversion
would happen on already freed blocks). But I'm thinking that with DIO vs
truncate protection in VFS, we probably don't need i_mutex for extent
conversion: The only think in ext4_end_io_nolock() that can possibly need
i_mutex is ext4_convert_unwritten_extents(). That function just starts a
transaction (i_mutex not needed) and calls ext4_map_blocks() which takes
i_data_sem for protection.

I'll let this brew in my head for a while and if I (or anyone else - hint,
hint) does not find a problem with this, I'll write a patch to remove the
lock...

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 43260] ftruncate locks up when used with direct IO on ext4
  2012-05-17 23:31 [Bug 43260] New: ftruncate locks up when used with direct IO on ext4 bugzilla-daemon
                   ` (2 preceding siblings ...)
  2012-05-24 10:16 ` bugzilla-daemon
@ 2015-02-19 17:26 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-02-19 17:26 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=43260

Alan <alan@lxorguk.ukuu.org.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |alan@lxorguk.ukuu.org.uk
         Resolution|---                         |OBSOLETE

--- Comment #4 from Alan <alan@lxorguk.ukuu.org.uk> ---
This bug relates to a very old kernel. Closing as obsolete.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-19 17:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-17 23:31 [Bug 43260] New: ftruncate locks up when used with direct IO on ext4 bugzilla-daemon
2012-05-18  2:06 ` [Bug 43260] " bugzilla-daemon
2012-05-21 23:08 ` bugzilla-daemon
2012-05-24 10:16 ` bugzilla-daemon
2015-02-19 17:26 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.