From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754757AbbKWSsZ (ORCPT ); Mon, 23 Nov 2015 13:48:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:56369 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752459AbbKWSsX (ORCPT ); Mon, 23 Nov 2015 13:48:23 -0500 Date: Mon, 23 Nov 2015 19:48:20 +0100 From: "Luis R. Rodriguez" To: Juergen Gross Cc: Vassilis Virvilis , linux-kernel@vger.kernel.org, Toshi Kani , mcgrof@suse.com, mcgrof@do-not-panic.com Subject: Re: Hibernate resume bug around 3,18-rc2 - Full PAT support Message-ID: <20151123184820.GY9528@wotan.suse.de> References: <564CF10E.6000905@iit.demokritos.gr> <564D6090.9020604@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <564D6090.9020604@suse.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 19, 2015 at 06:39:28AM +0100, Juergen Gross wrote: > On 18/11/15 22:43, Vassilis Virvilis wrote: > > Hi, > > > > I have been hit by a hibernate/resume bug. Other people may have too: > > The following links are consistent with my observations > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494 > > https://bugs.archlinux.org/task/44807 > > > > Some observations: > > 1) The first few rapid hibernation / resume cycles do not fail. > > > > 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel + > > thunderbird/icedove + Konsole) helps to reproduce and lock up during resume Let's try to speed up reproducing this. I have a hunch perhaps this might be related to some BIOS controlled MTRRs and a mismatch which then enables the kernel to think that a type of MTRR write might be OK, but in fact its not. Due to the work load description of this perhaps this could be related to fan control and BIOS control on them and against some other device MTRR. More on this suspicion on another thread where you provide more logs. On a kernel that you know fails can you try replacing this work load by making you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building for 2, 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if making the CPU fan trigger would accelerate the issue. If 'make -j' is too nuts to the point you can't even CTRL C it, try 'make -j 16' . Note that if this is true then that means a hot CPU could still trigger CPU fan controls on on a fresh boot if the previous boot was CPU intensive. If this doesn't do it lets try forcing an MTRR capable driver, say graphics is the obvious target, try perhaps some 3D stuff or a screen saver prior to hibernation. Note that even if you boot nomtrr the BIOS may still use MTRRs, and PAT use on Linux could assume MTRR is not being used on drivers but the BIOS may still do something behind the scenes. This is actually one reason why we can't exactly remove MTRR support from Linux, since the BIOS may still do some wacky stuff with MTRRs, one example of such I was given was CPU can control might use WC MTRRs, so the kernel must be aware of this, even if no MTRRs are ever used on the Linux kernel at all -- this is the case now as of v4.3 and onwards. If that doesn't help speed it up , maybe try both screen saver + some 3D stuff + cpu instensive stuff. To help you speed up testing you can try reducing your build time by reducing the amount of crap you have to build: make localmodconfig That should only build things your kernel has loaded as modules or is already enabled (=y). > > 3) Long hibernation times (overnight) helps to reproduce and lock up > > during resume > > > > 4) For the bad commits (where the lockup during resume takes place) - > > the image loading during resume is significantly faster. It is fast and > > then it locks. > > > > How I hit the problem and what I have done: > > > > I am running debian unstable > > > > Debian went from 3.16 to 3.19 - hence the problem raised its ugly head. > > I upgraded diligently up to 4.2.6 - The problem persists > > > > I started kernel bisection from 3.16 to 3.19 following > > https://wiki.debian.org/DebianKernel/GitBisect > > > > One month and 25 kernels later see below for the bisect log > > Wow! Thanks for doing this work! > Vassilis, indeed, the amount of work you have put into this is extremely appreciated! > Juergen > > > > > I hit some untestable kernel that weren't booting. They were hanging at > > "Loading ramdisk..." before any actual kernel message. > > > > Looks like the first bad / untestable commit is from Juergen Gross / > > Thomas Gleixner Merge branch 'x86-mm-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support] > > That is commit a023748d53c10850650fe86b1c4a7d421d576451 ("Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") Git is smart enough to tell you you've hit a merge commit and that all the possible commits on that merge could be the issue. This is why you bisect log shows a slew of commits. The next step is to bisect through the merge and then bisect through that, this will then let us identify the exact commit that may have caused the issue. There are a few ways to do this, my preferred way is to "unfold" a merge commit manually. To help keep thing separately (without affecting other tests you might have on your other git tree and to avoid having to force you to loose fresh object as you continue to build test on the other tree), I'd do something like this: mkdir ~/tmp git clone ~/linux/.git linux-dev-test cd linux-dev-test Notice how if you do git log and search for a023748d53c10850650fe86b1c4a7d421d576451 you'll see that the commit listed before this is 773fed910d41e443e495a6bfa9ab1c2b7b13e012 ("Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") To be clear the list of commits you typically would see is just: a023748d53c10850650fe86b1c4a7d421d576451 - Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 773fed910d41e443e495a6bfa9ab1c2b7b13e012 - Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip We want to go down into the commits in the merge commit a023748d53c and then zero out exactly which commit caused the issue. To do that on your linux-dev-test directory you can do this: git checkout -b test-merge-commit a023748d53c10850650fe86b1c4a7d421d576451 That will create branch for testing based on the merge commit. Now do this: git rebase -i 773fed910d41e443e495a6bfa9ab1c2b7b13e012 Then don't pick any commit, just save and exit the editor, and then git will actually "unfold" the merge commit for you -- it magically will apply each commit in that merge commit linearly into your git history. For instance the rebase should show 22 commits as follows, just leave the defaults set as in bewlow and just hit (ESCT + :wq if in vi): pick 96e70f832856 x86/mm: Avoid overlap the fixmap area on i386 pick 63e7b6d90c1e x86: mm: Re-use the early_ioremap fixed area pick bdee237c0343 x86: mm: Use 2GB memory block size on large-memory x86-64 systems pick 281d4078bec3 x86: Make page cache mode a real type pick c27ce0af896b x86: Use new cache mode type in include/asm/fb.h pick 2d85ebf8e12e x86: Use new cache mode type in drivers/video/fbdev/gbefb.c pick 5006e45a6bc2 x86: Use new cache mode type in drivers/video/fbdev/vermilion pick 1c64216be164 x86: Use new cache mode type in arch/x86/pci pick 2df58b6d3530 x86: Use new cache mode type in arch/x86/mm/init_64.c pick d85f33342a0f x86: Use new cache mode type in asm/pgtable.h pick 49a3b3cbdf16 x86: Use new cache mode type in mm/iomap_32.c pick 2a3746984c98 x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert() pick 102e19e1955d x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c pick c06814d8419a x86: Use new cache mode type in setting page attributes pick b14097bd911c x86: Use new cache mode type in mm/ioremap.c pick e00c8cc93c1a x86: Use new cache mode type in memtype related functions pick 87ad0b713b10 x86: Clean up pgtable_types.h pick f439c429c320 x86: Support PAT bit in pagetable dump for lower levels pick f5b2831d6541 x86: Respect PAT bit when copying pte values between large and normal pages pick bd809af16e3a x86: Enable PAT to use cache mode translation tables pick 47591df50512 xen: Support Xen pv-domains using PAT pick 0dbcae884779 x86: mm: Move PAT only functions to mm/pat.c You should see: Successfully rebased and updated refs/heads/test-merge-commit. Now if you do git log you will see the above commits in linear atomic history. You can now bisect this merge commit atomically, so do: git bisect 099487de0934e3d5e326666914a426af89a0868b 773fed910d41e443e495a6bfa9ab1c2b7b13e012 Note that this assumes that the commit prior to the merge commit is fine. Is this true, can you confirm? (git checkout -b test-prior-merge-gtest 773fed910d4, build and see if it doesn't break there) If we know for sure 773fed910d4 did not break anything then the above bisect should let us zero in on the exact atomic commit ID that caused the issue. > > Full disclaimer: I may have fucked up the bisection. Finding bad commits > > was semi easy - finding good commits needs a run time for 2-3 days. Reducing the amount of time it takes to reproduce a bug is art work but perhaps we can reduce that time. > > > > I would really appreciate some help and directions to nail this down. > > The amount of time and patience on your side is appreciated as well. > > > > Regards > > > > Vassilis Virvilis > > > > > > > > bill@localhost:~/Downloads/linux$ git bisect log > > git bisect start > > # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16 > > git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6 > > # bad: [bfa76d49576599a4b9f9b7a71f23d73d6dcff735] Linux 3.19 > > git bisect bad bfa76d49576599a4b9f9b7a71f23d73d6dcff735 > > # good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch > > 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping > > git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a > > # bad: [7ef58b32f571bffb7763c6252ad7527562081f34] Merge tag > > 'devicetree-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux > > git bisect bad 7ef58b32f571bffb7763c6252ad7527562081f34 > > # good: [53429290a054b30e4683297409fc4627b2592315] Merge > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc > > git bisect good 53429290a054b30e4683297409fc4627b2592315 > > # good: [3a647c1d7ab08145cee4b650f5e797d168846c51] Merge tag > > 'drivers-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc > > git bisect good 3a647c1d7ab08145cee4b650f5e797d168846c51 > > # bad: [1366f5d3129f2abde606214de7afc3dd61781fa3] Merge branch > > 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs > > git bisect bad 1366f5d3129f2abde606214de7afc3dd61781fa3 > > # good: [151cd97630f87451cab412e40750d0e5f7581c98] Merge tag > > 'defconfig-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc > > git bisect good 151cd97630f87451cab412e40750d0e5f7581c98 > > # good: [ecb50f0afd35a51ef487e8a54b976052eb03d729] Merge branch > > 'irq-core-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > git bisect good ecb50f0afd35a51ef487e8a54b976052eb03d729 > > # bad: [3a5dc1fafb016560315fe45bb4ef8bde259dd1bc] Merge branch > > 'x86-microcode-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > git bisect bad 3a5dc1fafb016560315fe45bb4ef8bde259dd1bc > > # good: [b6444bd0a18eb47343e16749ce80a6ebd521f124] Merge branch > > 'x86-boot-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > git bisect good b6444bd0a18eb47343e16749ce80a6ebd521f124 > > # bad: [a023748d53c10850650fe86b1c4a7d421d576451] Merge branch > > 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > git bisect bad a023748d53c10850650fe86b1c4a7d421d576451 > > # good: [773fed910d41e443e495a6bfa9ab1c2b7b13e012] Merge branches > > 'x86-platform-for-linus' and 'x86-uv-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > git bisect good 773fed910d41e443e495a6bfa9ab1c2b7b13e012 > > # good: [49a3b3cbdf1621678a39bd95a3e67c0f858539c7] x86: Use new cache > > mode type in mm/iomap_32.c > > git bisect good 49a3b3cbdf1621678a39bd95a3e67c0f858539c7 > > # skip: [87ad0b713b1034b6caf559976c35ce47f6d1d1e9] x86: Clean up > > pgtable_types.h > > git bisect skip 87ad0b713b1034b6caf559976c35ce47f6d1d1e9 > > # skip: [c06814d8419a74528500f85faf5fc01f67f8e7e6] x86: Use new cache > > mode type in setting page attributes > > git bisect skip c06814d8419a74528500f85faf5fc01f67f8e7e6 > > # skip: [e00c8cc93c1ac01ecd5049929a50fb47b62bb041] x86: Use new cache > > mode type in memtype related functions > > git bisect skip e00c8cc93c1ac01ecd5049929a50fb47b62bb041 > > # skip: [bd809af16e3ab1f8d55b3e2928c47c67e2a865d2] x86: Enable PAT to > > use cache mode translation tables > > git bisect skip bd809af16e3ab1f8d55b3e2928c47c67e2a865d2 > > # skip: [f439c429c320981943f8b64b2a4049d946cb492b] x86: Support PAT bit > > in pagetable dump for lower levels > > git bisect skip f439c429c320981943f8b64b2a4049d946cb492b > > # skip: [47591df505129c9774af6cca2debf283a6e56ed7] xen: Support Xen > > pv-domains using PAT > > git bisect skip 47591df505129c9774af6cca2debf283a6e56ed7 > > # skip: [b14097bd911c2554b0b5271b3a6b2d84044d1843] x86: Use new cache > > mode type in mm/ioremap.c > > git bisect skip b14097bd911c2554b0b5271b3a6b2d84044d1843 > > # skip: [102e19e1955d85f31475416b1ee22980c6462cf8] x86: Remove looking > > for setting of _PAGE_PAT_LARGE in pageattr.c > > git bisect skip 102e19e1955d85f31475416b1ee22980c6462cf8 > > # skip: [f5b2831d654167d77da8afbef4d2584897b12d0c] x86: Respect PAT bit > > when copying pte values between large and normal pages > > git bisect skip f5b2831d654167d77da8afbef4d2584897b12d0c > > # skip: [0dbcae884779fdf7e2239a97ac7488877f0693d9] x86: mm: Move PAT > > only functions to mm/pat.c > > git bisect skip 0dbcae884779fdf7e2239a97ac7488877f0693d9 > > # skip: [2a3746984c98b17b565e6a2c2bbaaaef757db1b4] x86: Use new cache > > mode type in track_pfn_remap() and track_pfn_insert() > > git bisect skip 2a3746984c98b17b565e6a2c2bbaaaef757db1b4 > > # only skipped commits left to test > > # possible first bad commit: [a023748d53c10850650fe86b1c4a7d421d576451] > > Merge branch 'x86-mm-for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > # possible first bad commit: [0dbcae884779fdf7e2239a97ac7488877f0693d9] > > x86: mm: Move PAT only functions to mm/pat.c > > # possible first bad commit: [47591df505129c9774af6cca2debf283a6e56ed7] > > xen: Support Xen pv-domains using PAT > > # possible first bad commit: [bd809af16e3ab1f8d55b3e2928c47c67e2a865d2] > > x86: Enable PAT to use cache mode translation tables > > # possible first bad commit: [f5b2831d654167d77da8afbef4d2584897b12d0c] > > x86: Respect PAT bit when copying pte values between large and normal pages > > # possible first bad commit: [f439c429c320981943f8b64b2a4049d946cb492b] > > x86: Support PAT bit in pagetable dump for lower levels > > # possible first bad commit: [87ad0b713b1034b6caf559976c35ce47f6d1d1e9] > > x86: Clean up pgtable_types.h > > # possible first bad commit: [e00c8cc93c1ac01ecd5049929a50fb47b62bb041] > > x86: Use new cache mode type in memtype related functions > > # possible first bad commit: [b14097bd911c2554b0b5271b3a6b2d84044d1843] > > x86: Use new cache mode type in mm/ioremap.c > > # possible first bad commit: [c06814d8419a74528500f85faf5fc01f67f8e7e6] > > x86: Use new cache mode type in setting page attributes > > # possible first bad commit: [102e19e1955d85f31475416b1ee22980c6462cf8] > > x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c > > # possible first bad commit: [2a3746984c98b17b565e6a2c2bbaaaef757db1b4] > > x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert() > -- Luis Rodriguez, SUSE LINUX GmbH Maxfeldstrasse 5; D-90409 Nuernberg