All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Nikolay Borisov <kernel@kyup.com>
Cc: Jan Kara <jack@suse.cz>, Hillf Danton <hillf.zj@alibaba-inc.com>,
	"'linux-kernel'" <linux-kernel@vger.kernel.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	linux-fsdevel@vger.kernel.org,
	SiteGround Operations <operations@siteground.com>,
	vbabka@suse.cz, gilad@benyossef.com, mgorman@suse.de,
	linux-mm@kvack.org, Marian Marinov <mm@1h.com>
Subject: Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI
Date: Tue, 13 Oct 2015 15:14:53 +0200	[thread overview]
Message-ID: <20151013131453.GA1332@quack.suse.cz> (raw)
In-Reply-To: <561CDEDC.30707@kyup.com>

[-- Attachment #1: Type: text/plain, Size: 9547 bytes --]

On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
> 
> 
> On 10/13/2015 11:15 AM, Jan Kara wrote:
> > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> >> Hello and thanks for the reply,
> >>
> >> On 10/12/2015 04:40 PM, Jan Kara wrote:
> >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> >>>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
> >>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>>>>>>>  			if (bio->bi_error)
> >>>>>>>>  				buffer_io_error(bh);
> >>>>>>>>  		} while ((bh = bh->b_this_page) != head);
> >>>>>>>> -		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> >>>>>>>>  		local_irq_restore(flags);
> >>>>>>>
> >>>>>>> What if it takes 100ms to unlock after IRQ restored?
> >>>>>>
> >>>>>> I'm not sure I understand in what direction you are going? Care to
> >>>>>> elaborate?
> >>>>>>
> >>>>> Your change introduces extra time cost the lock waiter has to pay in
> >>>>> the case that irq happens before the lock is released.
> >>>>
> >>>> [CC filesystem and mm people. For reference the thread starts here:
> >>>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> >>>>
> >>>> Right, I see what you mean and it's a good point but when doing the
> >>>> patches I was striving for correctness and starting a discussion, hence
> >>>> the RFC. In any case I'd personally choose correctness over performance
> >>>> always ;).
> >>>>
> >>>> As I'm not an fs/ext4 expert and have added the relevant parties (please
> >>>> use reply-all from now on so that the thread is not being cut in the
> >>>> middle) who will be able to say whether it impact is going to be that
> >>>> big. I guess in this particular code path worrying about this is prudent
> >>>> as writeback sounds like a heavily used path.
> >>>>
> >>>> Maybe the problem should be approached from a different angle e.g.
> >>>> drain_all_pages and its reliance on the fact that the IPI will always be
> >>>> delivered in some finite amount of time? But what if a cpu with disabled
> >>>> interrupts is waiting on the task issuing the IPI?
> >>>
> >>> So I have looked through your patch and also original report (thread starts
> >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> >>> __block_write_full_page_endio() call but that cannot really be the case.
> >>> BH_Uptodate_Lock is used only in IO completion handlers -
> >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> >>> really should be some end_io function running on some other CPU which holds
> >>> BH_Uptodate_Lock for that buffer.
> >>
> >> I did check all the call traces of the current processes on the machine
> >> at the time of the hard lockup and none of the 3 functions you mentioned
> >> were in any of the call chains. But while I was looking the code of
> >> end_buffer_async_write and in the comments I saw it was mentioned that
> >> those completion handler were called from __block_write_full_page_endio
> >> so that's what pointed my attention to that function. But you are right
> >> that it doesn't take the BH lock.
> >>
> >> Furthermore the fact that the BH_Async_Write flag is set points me in
> >> the direction that end_buffer_async_write should have been executing but
> >> as I said issuing "bt" for all the tasks didn't show this function.
> > 
> > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> > a more likely place where that flag got set since ext4_finish_bio() was
> > then handling IO completion.
> > 
> >> I'm beginning to wonder if it's possible that a single bit memory error
> >> has crept up, but this still seems like a long shot...
> > 
> > Yup. Possible but a long shot. Is the problem reproducible in any way?
> 
> Okay, I rule out hardware issue since a different server today 
> experienced the same hard lockup. One thing which looks 
> suspicious to me are the repetitions of bio_endio/clone_endio: 
> 
> Oct 13 03:16:54 10.80.5.48 Call Trace:
> Oct 13 03:16:54 10.80.5.48 <NMI>
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 <<EOE>> 
> Oct 13 03:16:54 10.80.5.48 <IRQ> 
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f
> Oct 13 03:16:54 10.80.5.48 <EOI> 
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13
> Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]---
> 
> Doing addr2line on ffffffff8125c2c8 shows:
> /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the
> last bio_put in ext4_end_bio. However, the ? addresses, right at the
> beginning of the NMI stack (ffffffff8125be19) map to inner loop in
> bit_spin_lock:
> 
> } while (test_bit(bitnum, addr));
> 
> and this is in line with my initial bug report. 

OK.

> Unfortunately I wasn't able to acquire a crashdump since the machine
> hard-locked way too fast.
>
> On a slightly different note is it possible to
> panic the machine via NMIs? Since if all the CPUs are hard lockedup they
> cannot process sysrq interrupts?

Certainly it's possible to do that - the easiest way is actually to use

nmi_watchdog=panic

Then panic will automatically trigger when watchdog fires.

> >> Btw I think in any case the spin_lock patch is wrong as this code can be
> >> called from within softirq context and we do want to be interrupt safe
> >> at that point.
> > 
> > Agreed, that patch is definitely wrong.
> > 
> >>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it?
> >>
> >> Unfortunately I cannot tell you with 100% certainty, since on this
> >> server there are multiple block devices with blocksize either 1k or 4k.
> >> So it is one of these. If you know a way to extract this information
> >> from a vmcore file I'd be happy to do it.
> > 
> > Well, if you have a crashdump, then bh->b_size is the block size. So just
> > check that for the bh we are spinning on.
> 
> Turns out in my original email the bh->b_size was shown : 
> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. 

OK, then I have a theory. We can manipulate bh->b_state in a non-atomic
manner in _ext4_get_block(). If we happen to do that on the first buffer in
a page while IO completes on another buffer in the same page, we could in
theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try
whether the attached patch fixes your problem?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-bh-b_state-corruption.patch --]
[-- Type: text/x-patch, Size: 2911 bytes --]

>From 4437fcab09fdbac2136f4fbd1dd7530ac0ec5b3a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.com>
Date: Tue, 13 Oct 2015 14:59:54 +0200
Subject: [PATCH] ext4: Fix bh->b_state corruption

ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just an
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.

Fix the problem by always updating bh->b_state bits atomically.

Signed-off-by: Jan Kara <jack@suse.com>
---
 fs/ext4/inode.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf76b5c..77604002ae75 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -657,6 +657,24 @@ has_zeroout:
 	return retval;
 }
 
+/* 
+ * Update EXT4_MAP_FLAGS in bh->b_state atomically according to 'flags'. This
+ * is ugly but once we get rid of using bh as a container for mapping
+ * information to pass to / from get_block functions, this can go away.
+ */
+static void ext4_update_bh_state(struct buffer_head *bh, unsigned long flags)
+{
+	int i;
+
+	for (i = 0; i < BITS_PER_LONG; i++)
+		if ((1 << i) & EXT4_MAP_FLAGS) {
+			if (flags & (1 << i))
+				set_bit(i, &bh->b_state);
+			else
+				clear_bit(i, &bh->b_state);
+		}
+}
+
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -693,7 +711,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_io_end_t *io_end = ext4_inode_aio(inode);
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
-		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+		ext4_update_bh_state(bh, map.m_flags);
 		if (IS_DAX(inode) && buffer_unwritten(bh)) {
 			/*
 			 * dgc: I suspect unwritten conversion on ext4+DAX is
@@ -1637,7 +1655,7 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 		return ret;
 
 	map_bh(bh, inode->i_sb, map.m_pblk);
-	bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+	ext4_update_bh_state(bh, map.m_flags);
 
 	if (buffer_unwritten(bh)) {
 		/* A delayed write to unwritten bh should be marked
-- 
2.1.4


WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Nikolay Borisov <kernel@kyup.com>
Cc: Jan Kara <jack@suse.cz>, Hillf Danton <hillf.zj@alibaba-inc.com>,
	'linux-kernel' <linux-kernel@vger.kernel.org>,
	Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	linux-fsdevel@vger.kernel.org,
	SiteGround Operations <operations@siteground.com>,
	vbabka@suse.cz, gilad@benyossef.com, mgorman@suse.de,
	linux-mm@kvack.org, Marian Marinov <mm@1h.com>
Subject: Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI
Date: Tue, 13 Oct 2015 15:14:53 +0200	[thread overview]
Message-ID: <20151013131453.GA1332@quack.suse.cz> (raw)
In-Reply-To: <561CDEDC.30707@kyup.com>

[-- Attachment #1: Type: text/plain, Size: 9547 bytes --]

On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
> 
> 
> On 10/13/2015 11:15 AM, Jan Kara wrote:
> > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> >> Hello and thanks for the reply,
> >>
> >> On 10/12/2015 04:40 PM, Jan Kara wrote:
> >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> >>>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
> >>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>>>>>>>  			if (bio->bi_error)
> >>>>>>>>  				buffer_io_error(bh);
> >>>>>>>>  		} while ((bh = bh->b_this_page) != head);
> >>>>>>>> -		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> >>>>>>>>  		local_irq_restore(flags);
> >>>>>>>
> >>>>>>> What if it takes 100ms to unlock after IRQ restored?
> >>>>>>
> >>>>>> I'm not sure I understand in what direction you are going? Care to
> >>>>>> elaborate?
> >>>>>>
> >>>>> Your change introduces extra time cost the lock waiter has to pay in
> >>>>> the case that irq happens before the lock is released.
> >>>>
> >>>> [CC filesystem and mm people. For reference the thread starts here:
> >>>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> >>>>
> >>>> Right, I see what you mean and it's a good point but when doing the
> >>>> patches I was striving for correctness and starting a discussion, hence
> >>>> the RFC. In any case I'd personally choose correctness over performance
> >>>> always ;).
> >>>>
> >>>> As I'm not an fs/ext4 expert and have added the relevant parties (please
> >>>> use reply-all from now on so that the thread is not being cut in the
> >>>> middle) who will be able to say whether it impact is going to be that
> >>>> big. I guess in this particular code path worrying about this is prudent
> >>>> as writeback sounds like a heavily used path.
> >>>>
> >>>> Maybe the problem should be approached from a different angle e.g.
> >>>> drain_all_pages and its reliance on the fact that the IPI will always be
> >>>> delivered in some finite amount of time? But what if a cpu with disabled
> >>>> interrupts is waiting on the task issuing the IPI?
> >>>
> >>> So I have looked through your patch and also original report (thread starts
> >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> >>> __block_write_full_page_endio() call but that cannot really be the case.
> >>> BH_Uptodate_Lock is used only in IO completion handlers -
> >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> >>> really should be some end_io function running on some other CPU which holds
> >>> BH_Uptodate_Lock for that buffer.
> >>
> >> I did check all the call traces of the current processes on the machine
> >> at the time of the hard lockup and none of the 3 functions you mentioned
> >> were in any of the call chains. But while I was looking the code of
> >> end_buffer_async_write and in the comments I saw it was mentioned that
> >> those completion handler were called from __block_write_full_page_endio
> >> so that's what pointed my attention to that function. But you are right
> >> that it doesn't take the BH lock.
> >>
> >> Furthermore the fact that the BH_Async_Write flag is set points me in
> >> the direction that end_buffer_async_write should have been executing but
> >> as I said issuing "bt" for all the tasks didn't show this function.
> > 
> > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> > a more likely place where that flag got set since ext4_finish_bio() was
> > then handling IO completion.
> > 
> >> I'm beginning to wonder if it's possible that a single bit memory error
> >> has crept up, but this still seems like a long shot...
> > 
> > Yup. Possible but a long shot. Is the problem reproducible in any way?
> 
> Okay, I rule out hardware issue since a different server today 
> experienced the same hard lockup. One thing which looks 
> suspicious to me are the repetitions of bio_endio/clone_endio: 
> 
> Oct 13 03:16:54 10.80.5.48 Call Trace:
> Oct 13 03:16:54 10.80.5.48 <NMI>
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 <<EOE>> 
> Oct 13 03:16:54 10.80.5.48 <IRQ> 
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
> Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0
> Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f
> Oct 13 03:16:54 10.80.5.48 <EOI> 
> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13
> Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]---
> 
> Doing addr2line on ffffffff8125c2c8 shows:
> /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the
> last bio_put in ext4_end_bio. However, the ? addresses, right at the
> beginning of the NMI stack (ffffffff8125be19) map to inner loop in
> bit_spin_lock:
> 
> } while (test_bit(bitnum, addr));
> 
> and this is in line with my initial bug report. 

OK.

> Unfortunately I wasn't able to acquire a crashdump since the machine
> hard-locked way too fast.
>
> On a slightly different note is it possible to
> panic the machine via NMIs? Since if all the CPUs are hard lockedup they
> cannot process sysrq interrupts?

Certainly it's possible to do that - the easiest way is actually to use

nmi_watchdog=panic

Then panic will automatically trigger when watchdog fires.

> >> Btw I think in any case the spin_lock patch is wrong as this code can be
> >> called from within softirq context and we do want to be interrupt safe
> >> at that point.
> > 
> > Agreed, that patch is definitely wrong.
> > 
> >>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it?
> >>
> >> Unfortunately I cannot tell you with 100% certainty, since on this
> >> server there are multiple block devices with blocksize either 1k or 4k.
> >> So it is one of these. If you know a way to extract this information
> >> from a vmcore file I'd be happy to do it.
> > 
> > Well, if you have a crashdump, then bh->b_size is the block size. So just
> > check that for the bh we are spinning on.
> 
> Turns out in my original email the bh->b_size was shown : 
> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. 

OK, then I have a theory. We can manipulate bh->b_state in a non-atomic
manner in _ext4_get_block(). If we happen to do that on the first buffer in
a page while IO completes on another buffer in the same page, we could in
theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try
whether the attached patch fixes your problem?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-bh-b_state-corruption.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



  reply	other threads:[~2015-10-13 13:15 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-09  7:19 [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI Hillf Danton
2015-10-09  7:26 ` Nikolay Borisov
2015-10-09  7:37   ` Hillf Danton
2015-10-09  8:03     ` Nikolay Borisov
2015-10-09  8:03       ` Nikolay Borisov
2015-10-12 13:40       ` Jan Kara
2015-10-12 13:40         ` Jan Kara
2015-10-12 14:51         ` Nikolay Borisov
2015-10-12 14:51           ` Nikolay Borisov
2015-10-13  8:15           ` Jan Kara
2015-10-13  8:15             ` Jan Kara
2015-10-13 10:37             ` Nikolay Borisov
2015-10-13 10:37               ` Nikolay Borisov
2015-10-13 13:14               ` Jan Kara [this message]
2015-10-13 13:14                 ` Jan Kara
2015-10-14  9:02                 ` Nikolay Borisov
2015-10-14  9:02                   ` Nikolay Borisov
2015-10-16  8:08                 ` Nikolay Borisov
2015-10-16 12:51                   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2015-10-08 15:31 Nikolay Borisov
     [not found] ` <CAOtvUMcrhq3epOPCEciMGq53S6rTyURAKEWhQ=NwrkF95aJ+_Q@mail.gmail.com>
2015-10-09  8:50   ` Nikolay Borisov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151013131453.GA1332@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=gilad@benyossef.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=kernel@kyup.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mm@1h.com \
    --cc=operations@siteground.com \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.