From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: INFO: task hung in generic_file_write_iter Date: Tue, 15 Jan 2019 10:29:52 +0100 Message-ID: <20190115092952.GF29524@quack2.suse.cz> References: <20190102144015.GA23089@quack2.suse.cz> <275523c6-f750-44c2-a8a4-f3825eeab788@i-love.sakura.ne.jp> <20190102172636.GA29127@quack2.suse.cz> <20190108112425.GC8076@quack2.suse.cz> <20190109133006.GG15397@quack2.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Tetsuo Handa , Andrew Morton , Alexander Viro , syzbot , Linux-MM , Mel Gorman , Michal Hocko , Andi Kleen , jlayton@redhat.com, LKML , syzkaller-bugs , tim.c.chen@linux.intel.com, linux-fsdevel To: Dmitry Vyukov Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Mon 14-01-19 16:13:08, Dmitry Vyukov wrote: > On Mon, Jan 14, 2019 at 4:11 PM Dmitry Vyukov wrote: > > > > On Wed, Jan 9, 2019 at 2:30 PM Jan Kara wrote: > > > > > > On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote: > > > > On Tue, Jan 8, 2019 at 12:24 PM Jan Kara wrote: > > > > > > > > > > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote: > > > > > > On 2019/01/03 2:26, Jan Kara wrote: > > > > > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote: > > > > > > >> On 2019/01/02 23:40, Jan Kara wrote: > > > > > > >>> I had a look into this and the only good explanation for this I have is > > > > > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits). > > > > > > >>> If that would happen, we'd get exactly the behavior syzkaller observes > > > > > > >>> because grow_buffers() would populate different page than > > > > > > >>> __find_get_block() then looks up. > > > > > > >>> > > > > > > >>> However I don't see how that's possible since the filesystem has the block > > > > > > >>> device open exclusively and blkdev_bszset() makes sure we also have > > > > > > >>> exclusive access to the block device before changing the block device size. > > > > > > >>> So changing block device block size after filesystem gets access to the > > > > > > >>> device should be impossible. > > > > > > >>> > > > > > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed > > > > > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us > > > > > > >>> whether my theory is right or not. Thanks! > > > > > > >>> > > > > > > > > > > > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12. > > > > > > > > > > > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000 > > > > > > > > > > > > [ 385.723941][ T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12 > > > > > > (...snipped...) > > > > > > [ 568.159544][ T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12 > > > > > > > > > > Right, so indeed the block size in the superblock and in the block device > > > > > gets out of sync which explains why we endlessly loop in the buffer cache > > > > > code. The superblock uses blocksize of 512 while the block device thinks > > > > > the set block size is 4096. > > > > > > > > > > And after staring into the code for some time, I finally have a trivial > > > > > reproducer: > > > > > > > > > > truncate -s 1G /tmp/image > > > > > losetup /dev/loop0 /tmp/image > > > > > mkfs.ext4 -b 1024 /dev/loop0 > > > > > mount -t ext4 /dev/loop0 /mnt > > > > > losetup -c /dev/loop0 > > > > > l /mnt > > > > > > > > > > > > > > > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block > > > > > device block size to 4096 by calling bd_set_size(). I have to think how to > > > > > best fix this... > > > > > > > > > > Thanks for your help with debugging this! > > > > > > > > Wow! I am very excited. > > > > We have 587 open "task hung" reports, I suspect this explains lots of them. > > > > What would be some pattern that we can use to best-effort distinguish > > > > most manifestations? Skimming through few reports I see "inode_lock", > > > > "get_super", "blkdev_put" as common indicators. Anything else? > > > > > > Well, there will be always looping task with __getblk_gfp() on its stack > > > (which should be visible in the stacktrace generated by the stall > > > detector). Then there can be lots of other processes getting blocked due to > > > locks and other resources held by this task... > > > > > > Once we have a fix, I plan to do a sweep over existing open "task I have submitted the fix yesterday to linux-block ([PATCH 0/2] blkdev: Fix livelock when loop device updates capacity). > > hung" reports and dup lots of them onto this one. Probably preferring > > to over-sweep rather then to under-sweep because there are too many of > > them and lots does not seem to be actionable otherwise. > > Tetsuo, do you have comments before I start? > > Also, is it possible to add some kind of WARNING for this condition? > Taking into account how much effort it too to debug, looks like a > useful check. Or did I ask this already... There are two things we could do: 1) Warn if we loop in __getblk_slow() more than couple of times (looping once is normal, looping twice can happen easily due to races). 2) Warn & bail if block size passed to __getblk_slow() does not match the block device block size. I'll write the patch. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 910ABC43444 for ; Tue, 15 Jan 2019 09:30:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 59B682085A for ; Tue, 15 Jan 2019 09:30:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728647AbfAOJ34 (ORCPT ); Tue, 15 Jan 2019 04:29:56 -0500 Received: from mx2.suse.de ([195.135.220.15]:59334 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728593AbfAOJ34 (ORCPT ); Tue, 15 Jan 2019 04:29:56 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BE6D9ACBC; Tue, 15 Jan 2019 09:29:53 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id BE5C31E0620; Tue, 15 Jan 2019 10:29:52 +0100 (CET) Date: Tue, 15 Jan 2019 10:29:52 +0100 From: Jan Kara To: Dmitry Vyukov Cc: Jan Kara , Tetsuo Handa , Andrew Morton , Alexander Viro , syzbot , Linux-MM , Mel Gorman , Michal Hocko , Andi Kleen , jlayton@redhat.com, LKML , syzkaller-bugs , tim.c.chen@linux.intel.com, linux-fsdevel Subject: Re: INFO: task hung in generic_file_write_iter Message-ID: <20190115092952.GF29524@quack2.suse.cz> References: <20190102144015.GA23089@quack2.suse.cz> <275523c6-f750-44c2-a8a4-f3825eeab788@i-love.sakura.ne.jp> <20190102172636.GA29127@quack2.suse.cz> <20190108112425.GC8076@quack2.suse.cz> <20190109133006.GG15397@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190115092952.SDfMK3lxvdlSeHJawqaDOG2gufvx-VfOm4vi49AtCdg@z> On Mon 14-01-19 16:13:08, Dmitry Vyukov wrote: > On Mon, Jan 14, 2019 at 4:11 PM Dmitry Vyukov wrote: > > > > On Wed, Jan 9, 2019 at 2:30 PM Jan Kara wrote: > > > > > > On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote: > > > > On Tue, Jan 8, 2019 at 12:24 PM Jan Kara wrote: > > > > > > > > > > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote: > > > > > > On 2019/01/03 2:26, Jan Kara wrote: > > > > > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote: > > > > > > >> On 2019/01/02 23:40, Jan Kara wrote: > > > > > > >>> I had a look into this and the only good explanation for this I have is > > > > > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits). > > > > > > >>> If that would happen, we'd get exactly the behavior syzkaller observes > > > > > > >>> because grow_buffers() would populate different page than > > > > > > >>> __find_get_block() then looks up. > > > > > > >>> > > > > > > >>> However I don't see how that's possible since the filesystem has the block > > > > > > >>> device open exclusively and blkdev_bszset() makes sure we also have > > > > > > >>> exclusive access to the block device before changing the block device size. > > > > > > >>> So changing block device block size after filesystem gets access to the > > > > > > >>> device should be impossible. > > > > > > >>> > > > > > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed > > > > > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us > > > > > > >>> whether my theory is right or not. Thanks! > > > > > > >>> > > > > > > > > > > > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12. > > > > > > > > > > > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000 > > > > > > > > > > > > [ 385.723941][ T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12 > > > > > > (...snipped...) > > > > > > [ 568.159544][ T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12 > > > > > > > > > > Right, so indeed the block size in the superblock and in the block device > > > > > gets out of sync which explains why we endlessly loop in the buffer cache > > > > > code. The superblock uses blocksize of 512 while the block device thinks > > > > > the set block size is 4096. > > > > > > > > > > And after staring into the code for some time, I finally have a trivial > > > > > reproducer: > > > > > > > > > > truncate -s 1G /tmp/image > > > > > losetup /dev/loop0 /tmp/image > > > > > mkfs.ext4 -b 1024 /dev/loop0 > > > > > mount -t ext4 /dev/loop0 /mnt > > > > > losetup -c /dev/loop0 > > > > > l /mnt > > > > > > > > > > > > > > > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block > > > > > device block size to 4096 by calling bd_set_size(). I have to think how to > > > > > best fix this... > > > > > > > > > > Thanks for your help with debugging this! > > > > > > > > Wow! I am very excited. > > > > We have 587 open "task hung" reports, I suspect this explains lots of them. > > > > What would be some pattern that we can use to best-effort distinguish > > > > most manifestations? Skimming through few reports I see "inode_lock", > > > > "get_super", "blkdev_put" as common indicators. Anything else? > > > > > > Well, there will be always looping task with __getblk_gfp() on its stack > > > (which should be visible in the stacktrace generated by the stall > > > detector). Then there can be lots of other processes getting blocked due to > > > locks and other resources held by this task... > > > > > > Once we have a fix, I plan to do a sweep over existing open "task I have submitted the fix yesterday to linux-block ([PATCH 0/2] blkdev: Fix livelock when loop device updates capacity). > > hung" reports and dup lots of them onto this one. Probably preferring > > to over-sweep rather then to under-sweep because there are too many of > > them and lots does not seem to be actionable otherwise. > > Tetsuo, do you have comments before I start? > > Also, is it possible to add some kind of WARNING for this condition? > Taking into account how much effort it too to debug, looks like a > useful check. Or did I ask this already... There are two things we could do: 1) Warn if we loop in __getblk_slow() more than couple of times (looping once is normal, looping twice can happen easily due to races). 2) Warn & bail if block size passed to __getblk_slow() does not match the block device block size. I'll write the patch. Honza -- Jan Kara SUSE Labs, CR