From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: RT/ext4/jbd2 circular dependency Date: Thu, 30 Oct 2014 19:24:37 -0400 Message-ID: <20141030232437.GF31927@thunk.org> References: <54415991.1070907@pavlinux.ru> <544940EF.7090907@windriver.com> <544E7144.4080809@windriver.com> <54513BDA.1050804@windriver.com> <20141029231916.GD5000@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Chris Friesen , Austin Schuh , pavel@pavlinux.ru, "J. Bruce Fields" , linux-ext4@vger.kernel.org, adilger.kernel@dilger.ca, rt-users To: Thomas Gleixner Return-path: Received: from imap.thunk.org ([74.207.234.97]:54116 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161211AbaJ3XYw (ORCPT ); Thu, 30 Oct 2014 19:24:52 -0400 Content-Disposition: inline In-Reply-To: Sender: linux-rt-users-owner@vger.kernel.org List-ID: On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote: > > That's a way better explanation than what I saw in the commit logs and > it actually maps to the observed traces and stackdumps. I can't speak for Jan, but I suspect he didn't realize that there was a problem. The commit description in b34090e5e2 makes it clear that the intent was a performance improvement, and not an attempt to fix a potential deadlock bug. Looking at the commit history, the problem was introduced in 2.6.27 (July 2008), in commit c851ed54017373, so this problem wasn't noticed in the RHEL 6 and RHEL 7 enterprise linux QA runs, and it wasn't noticed in all of the regression testing that we've been doing. I've certainly seen this before. Two years ago we found a bug that was only noticed when we deployed ext4 in production at Google, and stress tested it at Google scale with the appropriate monitoring systems so we could find a bug that had existed since the very beginning of ext3, and which had never been noticed in all of the enterprise testing done by Red Hat, SuSE, IBM, HP, etc. Actually, it probably was noticed, but never in a reproducible way, and so it was probably written off as some kind of flaky hardware induced corruption. The difference is that in this case, it seems that Chris and Kevin was able to reproduce the problem reliably. (It also might be that the RT patch kits widens the race window and makes it much more likely to trigger.) Chris or Kevin, if you have time to try to create a reliable repro that is small/simple enough that we could propose it as an new test to add to xfstests, that would be great. If you can't, that's completely understable. In the case I described above, it was an extremely hard to hit race that only happened under high memory pressure, so we never able to create a reliable repro. Instead we had a theory that was consistent pattern of metadata corruption we were seeing, deployed a kernel with the fix, and after a few weeks were able to conclude we had finally fixed the bug. Welcome to file system debugging. :-) > Thanks for the clarification! I'm just getting nervous when 'picked > some backports' magically 'fixes' an issue without a proper > explanation. Well, thanks to Chris for pointing out that b34090e5 seemed to make the problem go away. Once I looked at what that patch changed, it was a lot more obvious what might have been going wrong. It's always helpful if you can beek at the answer key, even if it's a only potential answer key. :-) Cheers, - Ted