All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paul Menzel <pmenzel@molgen.mpg.de>
To: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: regressions@lists.linux.dev, lijo.lazar@amd.com,
	amd-gfx@lists.freedesktop.org, alexander.deucher@amd.com,
	evan.quan@amd.com, kenneth.feng@amd.com,
	christian.koenig@amd.com
Subject: Regression: No signal when loading amdgpu, and system lockup (was: [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c)
Date: Mon, 4 Apr 2022 14:06:50 +0200	[thread overview]
Message-ID: <9e689fea-6c69-f4b0-8dee-32c4cf7d8f9c@molgen.mpg.de> (raw)
In-Reply-To: <20220331022805.17236-1-amarsh04@internode.on.net>

#regzbot introduced: 3712e7a494596b26861f4dc9b81676d1d0272eaf
#regzbot title: No signal when loading amdgpu, and system lockup
#regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

Dear Arthur,


Thank you for your message. Too bad you didn’t update the subject, and 
didn’t include regressions@lists.linux.dev and notify regzbot [1] about 
it. (It’s understandable, as you might be unfamiliar with the processes, 
but I would have expected at least Even to do.) So I also spent quite 
some time on bisecting, but reached the same conclusion.

Am 31.03.22 um 04:28 schrieb Arthur Marsh:
> Hi, I have a Cape Verde GPU card in my pc and after git bisecting a situation
> where, at the time of the amdgpu module, the monitor would lose signal and
> the pc locked up so that it only responded to a magic sysreq boot (with no
> logging due to it happening before the root filesystem was writeable), the
> above commit was identified as the culprit.
> 
> The GPU card is a Gigabyte R7 250 with pci-id 1002:682b (rev 87).
> 
> With the 5.17.0 kernel and a kernel command line of:
> 
> amdgpu.audio=1 amdgpu.si_support=1
> 
> the following dmesg output was received:
> 
> [   76.118991] [drm] amdgpu kernel modesetting enabled.
> [   76.119100] amdgpu 0000:01:00.0: vgaarb: deactivate vga console
> [   76.120004] Console: switching to colour dummy device 80x25
> [   76.120203] [drm] initializing kernel modesetting (VERDE 0x1002:0x682B 0x1458:0x22CA 0x87).
> [   76.120211] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [   76.120235] [drm] register mmio base: 0xFE8C0000
> [   76.120238] [drm] register mmio size: 262144
> [   76.120245] [drm] add ip block number 0 <si_common>
> [   76.120248] [drm] add ip block number 1 <gmc_v6_0>
> [   76.120251] [drm] add ip block number 2 <si_ih>
> [   76.120253] [drm] add ip block number 3 <gfx_v6_0>
> [   76.120256] [drm] add ip block number 4 <si_dma>
> [   76.120258] [drm] add ip block number 5 <si_dpm>
> [   76.120261] [drm] add ip block number 6 <dce_v6_0>
> [   76.120264] [drm] add ip block number 7 <uvd_v3_1>
> [   76.163659] [drm] BIOS signature incorrect 5b 7
> [   76.163669] resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000dffff window]
> [   76.163677] caller pci_map_rom+0x68/0x1b0 mapping multiple BARs
> [   76.163691] amdgpu 0000:01:00.0: No more image in the PCI ROM
> [   76.164996] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [   76.165001] amdgpu: ATOM BIOS: xxx-xxx-xxx
> [   76.165018] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
> [   76.165270] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
> [   76.349679] amdgpu 0000:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
> [   76.349716] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
> [   76.349753] [drm] Detected VRAM RAM=2048M, BAR=256M
> [   76.349764] [drm] RAM width 128bits DDR3
> [   76.349940] [drm] amdgpu: 2048M of VRAM memory ready
> [   76.349953] [drm] amdgpu: 3072M of GTT memory ready.
> [   76.349992] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [   76.350506] amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400900000).
> [   76.495343] [drm] Internal thermal controller with fan control
> [   76.495391] [drm] amdgpu: dpm initialized
> [   76.495637] [drm] AMDGPU Display Connectors
> [   76.495647] [drm] Connector 0:
> [   76.495655] [drm]   HDMI-A-1
> [   76.495662] [drm]   HPD1
> [   76.495668] [drm]   DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f
> [   76.495685] [drm]   Encoders:
> [   76.495691] [drm]     DFP1: INTERNAL_UNIPHY
> [   76.495699] [drm] Connector 1:
> [   76.495706] [drm]   DVI-D-1
> [   76.495712] [drm]   HPD2
> [   76.495718] [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
> [   76.495733] [drm]   Encoders:
> [   76.495739] [drm]     DFP2: INTERNAL_UNIPHY
> [   76.495746] [drm] Connector 2:
> [   76.495753] [drm]   VGA-1
> [   76.495758] [drm]   DDC: 0x1970 0x1970 0x1971 0x1971 0x1972 0x1972 0x1973 0x1973
> [   76.495773] [drm]   Encoders:
> [   76.495779] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
> [   76.599604] [drm] Found UVD firmware Version: 64.0 Family ID: 13
> [   76.603443] [drm] PCIE gen 2 link speeds already enabled
> [   77.149564] [drm] UVD initialized successfully.
> [   77.149578] amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 5, active_cu_number 8
> [   77.456492] RTL8211B Gigabit Ethernet r8169-0-300:00: attached PHY driver (mii_bus:phy_addr=r8169-0-300:00, irq=MAC)
> [   77.486245] [drm] Initialized amdgpu 3.44.0 20150101 for 0000:01:00.0 on minor 0
> [   77.521555] r8169 0000:03:00.0 eth0: Link is Down
> [   77.547158] fbcon: amdgpudrmfb (fb0) is primary device
> [   77.591226] Console: switching to colour frame buffer device 240x67
> [   77.600296] amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
> 
> I can supply extra details but found no logging from the sessions that experienced the lock-up.

I had created issue *Worker [210] processing SEQNUM=2120 is taking a 
long time* on March 28th, 2022 [1]. Most annoying thing was, that the 
system locked up, and SSH didn’t work and Dell’s firmware (and all other 
x86 vendor firmware not based on coreboot) is so slow. Also, it was only 
reproducible with a powered-on monitor attached.


Kind regards,

Paul


[1]: https://linux-regtracking.leemhuis.info/regzbot/
[2]: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

WARNING: multiple messages have this Message-ID (diff)
From: Paul Menzel <pmenzel@molgen.mpg.de>
To: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: evan.quan@amd.com, alexander.deucher@amd.com, lijo.lazar@amd.com,
	kenneth.feng@amd.com, christian.koenig@amd.com,
	amd-gfx@lists.freedesktop.org, regressions@lists.linux.dev
Subject: Regression: No signal when loading amdgpu, and system lockup (was: [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c)
Date: Mon, 4 Apr 2022 14:06:50 +0200	[thread overview]
Message-ID: <9e689fea-6c69-f4b0-8dee-32c4cf7d8f9c@molgen.mpg.de> (raw)
In-Reply-To: <20220331022805.17236-1-amarsh04@internode.on.net>

#regzbot introduced: 3712e7a494596b26861f4dc9b81676d1d0272eaf
#regzbot title: No signal when loading amdgpu, and system lockup
#regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

Dear Arthur,


Thank you for your message. Too bad you didn’t update the subject, and 
didn’t include regressions@lists.linux.dev and notify regzbot [1] about 
it. (It’s understandable, as you might be unfamiliar with the processes, 
but I would have expected at least Even to do.) So I also spent quite 
some time on bisecting, but reached the same conclusion.

Am 31.03.22 um 04:28 schrieb Arthur Marsh:
> Hi, I have a Cape Verde GPU card in my pc and after git bisecting a situation
> where, at the time of the amdgpu module, the monitor would lose signal and
> the pc locked up so that it only responded to a magic sysreq boot (with no
> logging due to it happening before the root filesystem was writeable), the
> above commit was identified as the culprit.
> 
> The GPU card is a Gigabyte R7 250 with pci-id 1002:682b (rev 87).
> 
> With the 5.17.0 kernel and a kernel command line of:
> 
> amdgpu.audio=1 amdgpu.si_support=1
> 
> the following dmesg output was received:
> 
> [   76.118991] [drm] amdgpu kernel modesetting enabled.
> [   76.119100] amdgpu 0000:01:00.0: vgaarb: deactivate vga console
> [   76.120004] Console: switching to colour dummy device 80x25
> [   76.120203] [drm] initializing kernel modesetting (VERDE 0x1002:0x682B 0x1458:0x22CA 0x87).
> [   76.120211] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [   76.120235] [drm] register mmio base: 0xFE8C0000
> [   76.120238] [drm] register mmio size: 262144
> [   76.120245] [drm] add ip block number 0 <si_common>
> [   76.120248] [drm] add ip block number 1 <gmc_v6_0>
> [   76.120251] [drm] add ip block number 2 <si_ih>
> [   76.120253] [drm] add ip block number 3 <gfx_v6_0>
> [   76.120256] [drm] add ip block number 4 <si_dma>
> [   76.120258] [drm] add ip block number 5 <si_dpm>
> [   76.120261] [drm] add ip block number 6 <dce_v6_0>
> [   76.120264] [drm] add ip block number 7 <uvd_v3_1>
> [   76.163659] [drm] BIOS signature incorrect 5b 7
> [   76.163669] resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000dffff window]
> [   76.163677] caller pci_map_rom+0x68/0x1b0 mapping multiple BARs
> [   76.163691] amdgpu 0000:01:00.0: No more image in the PCI ROM
> [   76.164996] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [   76.165001] amdgpu: ATOM BIOS: xxx-xxx-xxx
> [   76.165018] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
> [   76.165270] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
> [   76.349679] amdgpu 0000:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
> [   76.349716] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
> [   76.349753] [drm] Detected VRAM RAM=2048M, BAR=256M
> [   76.349764] [drm] RAM width 128bits DDR3
> [   76.349940] [drm] amdgpu: 2048M of VRAM memory ready
> [   76.349953] [drm] amdgpu: 3072M of GTT memory ready.
> [   76.349992] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [   76.350506] amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400900000).
> [   76.495343] [drm] Internal thermal controller with fan control
> [   76.495391] [drm] amdgpu: dpm initialized
> [   76.495637] [drm] AMDGPU Display Connectors
> [   76.495647] [drm] Connector 0:
> [   76.495655] [drm]   HDMI-A-1
> [   76.495662] [drm]   HPD1
> [   76.495668] [drm]   DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f
> [   76.495685] [drm]   Encoders:
> [   76.495691] [drm]     DFP1: INTERNAL_UNIPHY
> [   76.495699] [drm] Connector 1:
> [   76.495706] [drm]   DVI-D-1
> [   76.495712] [drm]   HPD2
> [   76.495718] [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
> [   76.495733] [drm]   Encoders:
> [   76.495739] [drm]     DFP2: INTERNAL_UNIPHY
> [   76.495746] [drm] Connector 2:
> [   76.495753] [drm]   VGA-1
> [   76.495758] [drm]   DDC: 0x1970 0x1970 0x1971 0x1971 0x1972 0x1972 0x1973 0x1973
> [   76.495773] [drm]   Encoders:
> [   76.495779] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
> [   76.599604] [drm] Found UVD firmware Version: 64.0 Family ID: 13
> [   76.603443] [drm] PCIE gen 2 link speeds already enabled
> [   77.149564] [drm] UVD initialized successfully.
> [   77.149578] amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 5, active_cu_number 8
> [   77.456492] RTL8211B Gigabit Ethernet r8169-0-300:00: attached PHY driver (mii_bus:phy_addr=r8169-0-300:00, irq=MAC)
> [   77.486245] [drm] Initialized amdgpu 3.44.0 20150101 for 0000:01:00.0 on minor 0
> [   77.521555] r8169 0000:03:00.0 eth0: Link is Down
> [   77.547158] fbcon: amdgpudrmfb (fb0) is primary device
> [   77.591226] Console: switching to colour frame buffer device 240x67
> [   77.600296] amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
> 
> I can supply extra details but found no logging from the sessions that experienced the lock-up.

I had created issue *Worker [210] processing SEQNUM=2120 is taking a 
long time* on March 28th, 2022 [1]. Most annoying thing was, that the 
system locked up, and SSH didn’t work and Dell’s firmware (and all other 
x86 vendor firmware not based on coreboot) is so slow. Also, it was only 
reproducible with a powered-on monitor attached.


Kind regards,

Paul


[1]: https://linux-regtracking.leemhuis.info/regzbot/
[2]: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

  parent reply	other threads:[~2022-04-04 12:06 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-03  3:05 [PATCH V4 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power Evan Quan
2021-12-03  3:05 ` [PATCH V4 02/17] drm/amd/pm: do not expose power implementation details to amdgpu_pm.c Evan Quan
2021-12-09 11:57   ` Lazar, Lijo
2021-12-10  5:20     ` Quan, Evan
2021-12-10  5:53       ` Lazar, Lijo
2021-12-10 16:46         ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 03/17] drm/amd/pm: do not expose power implementation details to display Evan Quan
2021-12-09 12:05   ` Lazar, Lijo
2021-12-10  7:03     ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 04/17] drm/amd/pm: do not expose those APIs used internally only in amdgpu_dpm.c Evan Quan
2021-12-03  3:05 ` [PATCH V4 05/17] drm/amd/pm: do not expose those APIs used internally only in si_dpm.c Evan Quan
2021-12-09 12:08   ` Lazar, Lijo
2021-12-10  7:07     ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 06/17] drm/amd/pm: do not expose the API used internally only in kv_dpm.c Evan Quan
2021-12-03  3:05 ` [PATCH V4 07/17] drm/amd/pm: create a new holder for those APIs used only by legacy ASICs(si/kv) Evan Quan
2021-12-03  3:05 ` [PATCH V4 08/17] drm/amd/pm: move pp_force_state_enabled member to amdgpu_pm structure Evan Quan
2021-12-03  3:05 ` [PATCH V4 09/17] drm/amd/pm: optimize the amdgpu_pm_compute_clocks() implementations Evan Quan
2021-12-09 12:32   ` Lazar, Lijo
2021-12-10  8:46     ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 10/17] drm/amd/pm: move those code piece used by Stoney only to smu8_hwmgr.c Evan Quan
2021-12-03  3:05 ` [PATCH V4 11/17] drm/amd/pm: correct the usage for amdgpu_dpm_dispatch_task() Evan Quan
2021-12-09 12:37   ` Lazar, Lijo
2021-12-10  9:51     ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 12/17] drm/amd/pm: drop redundant or unused APIs and data structures Evan Quan
2021-12-03  3:05 ` [PATCH V4 13/17] drm/amd/pm: do not expose the smu_context structure used internally in power Evan Quan
2021-12-03  3:05 ` [PATCH V4 14/17] drm/amd/pm: relocate the power related headers Evan Quan
2021-12-09 12:40   ` Lazar, Lijo
2021-12-10  9:52     ` Quan, Evan
2021-12-03  3:05 ` [PATCH V4 15/17] drm/amd/pm: drop unnecessary gfxoff controls Evan Quan
2021-12-03  3:05 ` [PATCH V4 16/17] drm/amd/pm: revise the performance level setting APIs Evan Quan
2021-12-03  3:05 ` [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c Evan Quan
2021-12-06  6:51   ` Quan, Evan
2021-12-08  2:45   ` Quan, Evan
2022-03-31  2:28   ` Arthur Marsh
2022-03-31  3:16     ` Quan, Evan
2022-03-31  4:27       ` Arthur Marsh
2022-04-01  7:18         ` Quan, Evan
2022-04-01  8:49           ` Arthur Marsh
2022-04-01  8:56             ` Christian König
2022-04-01  9:19               ` Quan, Evan
2022-04-01 12:18                 ` Arthur Marsh
2022-04-02  1:31                   ` Quan, Evan
2022-04-08 12:24                     ` Quan, Evan
2022-04-08 13:47                       ` Arthur Marsh
2022-04-04 12:06     ` Paul Menzel [this message]
2022-04-04 12:06       ` Regression: No signal when loading amdgpu, and system lockup (was: [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c) Paul Menzel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e689fea-6c69-f4b0-8dee-32c4cf7d8f9c@molgen.mpg.de \
    --to=pmenzel@molgen.mpg.de \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=arthur.marsh@internode.on.net \
    --cc=christian.koenig@amd.com \
    --cc=evan.quan@amd.com \
    --cc=kenneth.feng@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.