linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* On disabling AGP without working alternative (PCI fallback is broken for years)
@ 2020-11-09 11:40 Thomas “illwieckz“ Debesse
  2020-11-09 13:57 ` Christian König
  2020-11-09 17:37 ` Deucher, Alexander
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas “illwieckz“ Debesse @ 2020-11-09 11:40 UTC (permalink / raw)
  To: LKML; +Cc: Christian König, Alex Deucher


[-- Attachment #1.1.1: Type: text/plain, Size: 6101 bytes --]

Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP
in default build.

It was signed-off by Christian König and Reviewed by Alex Deucher.
Distributions started to backport this commit, and it seems to have
happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was
built on Sep 10 2020.

Around that time I noticed AGP computers experiencing lock-ups and
other problems making them unusable after the upgrade. After
investigating what was happening bisecting Linux versions,
I reverted the commit and those computers were working again.

Commit message was:

> This means a performance regression for some GPUs,
> but also a bug fix for some others.

Unfortunately, this commit does not only introduce a performance
regression but makes some computers unusable, maybe all computers
with AMD CPUs.

One of the root cause may be that PCI GPUs are broken for years on
AMD platforms, it was tested and verified on:

- K8-based computer with AGP
- K8-based computer with PCI Express
- K10-based computer with AGP
- Piledriver-based computer with PCI Express

The breakage was tested and reproduced from Linux 4.4 to Linux
5.10-rc2 (I have not tried older than 4.4).

PCI GPUs may be broken on some other platforms, but I have found
that testing on an Intel PC (with PCI Express) does not reproduce
the issue when the PCI GPU hardware is plugged in.

There is two patches I'm requesting comments for:

## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask

https://lkml.org/lkml/2020/11/5/307

This one is not enough to fix PCI GPUs but it is enough to prevent
to fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
can't be fixed by this, and this uncovers other bug with AGP GPUs when
AGP is disabled at build time. Also, this patch may makes PCI GPUs
working on a non-optimal way on platform that accepts them with 40-bit
DMA bit mask (like Intel-based computers that already work without any
patch).

This patch is inspired from the patch made to solve that issue from
2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375

At the time, such change may have been enough to fix the issue, it's
not true any more. More breakage may have been introduced since.

Also, maybe this patch becomes useless when other PCI bugs are fixed,
who knows? At least, this is an entry-point for investigations.

## Revert "drm/radeon: disable AGP by default"

https://lkml.org/lkml/2020/11/5/308

This is the simple fix but currently only solution to make AMD hosts
with AGP port to get a display again, as without this reverts, those
computers do not have any alternative to run a display (even not PCI
GPUs).

I'm asking for comments on those patches. I may have reached my own
skill cap on kernel development anyway. I can repurpose hardware to
test any other patch and can contribute time for such testing. Unlike
AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time and
availability offered.

The PCI GPU on AMD CPU issue was verified with both Nvidia
(GS 8400GS rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU
sample not being old cards from the previous millennial but capable
ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98
on Nvidia side. They can both do OpenGL 3.3 and feature both
512M of VRAM. The ATI one had HDMI port, and it is known some variant
of the Nvidia one (not the one I own but same specification) had HDMI
port too.

Also, fixing PCI GPUs may not be enough to fix AGP GPUs running
as PCI ones, since fixing some issues (not all) on PCI side raises
new issues with AGP GPUs running as PCI ones but not on native PCI
GPUs (see below).

Bugs aside, one thing that is important to consider against the AGP
disablement is that there is such hardware that is very capable and
not that old out there. For example the ATI Radeon HD 4670 AGP
(RV730 XT) was still sold brand new after 2010 and is a powerful
and featureful GPUs with 1GB of VRAM and HDMI port. Performance with
it is still pretty decent on competitive games. To compare with other
 open source drivers mainlined in Linux, to outperform this GPU an
 user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.

Also, yet another thing that is important to consider against AGP
disablement is that if PCI Express was introduced in 2004, there
was still AGP compatible hardware being designed, produced and sold
very lately, especially on AMD side. Computers with quad core 64-bit
CPUs with virtualisation, 16GB of RAM and AGPs exist, and this is
widely distributed consumer hardware, not specific esoteric hardware.

So, not only powerful AGP GPUs were still sold brand new in the current
decade, but there was also very capable computers to host them. Because
of those AGP computers, fixing PCI GPUs fallback is not a solution
because PCI fallback is not a solution.

All that range of hardware became unusable with that commit disabling
AGP, without alternative.

Not only those AGP GPUs don't work with kernel's PCI fallback, but
unplugging those AGP GPUs and plugging physical PCI-native GPUs
instead does not work.

You'll find more details about the various issues on those bugs, I've
invested multiple full time day to test and reproduce bugs on a wide
range of hardware, I've attached, quoted and commented a lot of logs:

- https://bugs.launchpad.net/bugs/1899304
> AGP disablement leaves GPUs without working alternative
> (PCI fallback is broken), makes very-capable ATI TeraScale GPUs
> unusable

- https://bugs.launchpad.net/bugs/1902981
> AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
> time) are known to fail on K8 and K10 platforms

- https://bugs.launchpad.net/bugs/1902795
> PCI graphics broken on AMD K8/K10/Piledriver platform (while it
> works on Intel) verified from Linux 4.4 to 5.10-rc2

I wish to be personally CC'ed the answers/comments posted to the list
in response to my posting.

Thank you for your attention.

-- 
Thomas “illwieckz” Debesse

[-- Attachment #1.1.2: OpenPGP_0xE06292933E2CA275.asc --]
[-- Type: application/pgp-keys, Size: 7363 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: On disabling AGP without working alternative (PCI fallback is broken for years)
  2020-11-09 11:40 On disabling AGP without working alternative (PCI fallback is broken for years) Thomas “illwieckz“ Debesse
@ 2020-11-09 13:57 ` Christian König
  2020-11-09 17:37 ` Deucher, Alexander
  1 sibling, 0 replies; 8+ messages in thread
From: Christian König @ 2020-11-09 13:57 UTC (permalink / raw)
  To: Thomas “illwieckz“ Debesse, LKML; +Cc: Alex Deucher

Hi Thomas,

Am 09.11.20 um 12:40 schrieb Thomas “illwieckz“ Debesse:
> Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP
> in default build.
>
> It was signed-off by Christian König and Reviewed by Alex Deucher.
> Distributions started to backport this commit, and it seems to have
> happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was
> built on Sep 10 2020.
>
> Around that time I noticed AGP computers experiencing lock-ups and
> other problems making them unusable after the upgrade. After
> investigating what was happening bisecting Linux versions,
> I reverted the commit and those computers were working again.
>
> Commit message was:
>
>> This means a performance regression for some GPUs,
>> but also a bug fix for some others.
> Unfortunately, this commit does not only introduce a performance
> regression but makes some computers unusable, maybe all computers
> with AMD CPUs.
>
> One of the root cause may be that PCI GPUs are broken for years on
> AMD platforms, it was tested and verified on:
>
> - K8-based computer with AGP
> - K8-based computer with PCI Express
> - K10-based computer with AGP
> - Piledriver-based computer with PCI Express

That is interesting but doesn't make much sense from the technical 
perspective.

See AGP is build on top of PCI, if PCI doesn't work AGP won't work 
either. So why should AGP work while PCI doesn't?

If I'm not completely mistaken I should have a system from that time 
somewhere here.

> The breakage was tested and reproduced from Linux 4.4 to Linux
> 5.10-rc2 (I have not tried older than 4.4).
>
> PCI GPUs may be broken on some other platforms, but I have found
> that testing on an Intel PC (with PCI Express) does not reproduce
> the issue when the PCI GPU hardware is plugged in.
>
> There is two patches I'm requesting comments for:
>
> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
>
> https://lkml.org/lkml/2020/11/5/307
>
> This one is not enough to fix PCI GPUs but it is enough to prevent
> to fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
> can't be fixed by this, and this uncovers other bug with AGP GPUs when
> AGP is disabled at build time. Also, this patch may makes PCI GPUs
> working on a non-optimal way on platform that accepts them with 40-bit
> DMA bit mask (like Intel-based computers that already work without any
> patch).
>
> This patch is inspired from the patch made to solve that issue from
> 2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375
>
> At the time, such change may have been enough to fix the issue, it's
> not true any more. More breakage may have been introduced since.
>
> Also, maybe this patch becomes useless when other PCI bugs are fixed,
> who knows? At least, this is an entry-point for investigations.
>
> ## Revert "drm/radeon: disable AGP by default"
>
> https://lkml.org/lkml/2020/11/5/308
>
> This is the simple fix but currently only solution to make AMD hosts
> with AGP port to get a display again, as without this reverts, those
> computers do not have any alternative to run a display (even not PCI
> GPUs).

Well you can still use the agpmode parameter to override the default 
setting.

We simply don't have the time to support that older GPU and disabling 
AGP fixed quite a number of them.

> I'm asking for comments on those patches. I may have reached my own
> skill cap on kernel development anyway. I can repurpose hardware to
> test any other patch and can contribute time for such testing. Unlike
> AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time and
> availability offered.
>
> The PCI GPU on AMD CPU issue was verified with both Nvidia
> (GS 8400GS rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU
> sample not being old cards from the previous millennial but capable
> ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98
> on Nvidia side. They can both do OpenGL 3.3 and feature both
> 512M of VRAM. The ATI one had HDMI port, and it is known some variant
> of the Nvidia one (not the one I own but same specification) had HDMI
> port too.

To be honest I think we will completely drop AGP support in the next 5 
years or so, this includes both Nouveau as well as Radeon based GPUs.

We simply can't invest time maintaining a technology which is deprecated 
for nearly 15 years now.

Regards,
Christian.

>
> Also, fixing PCI GPUs may not be enough to fix AGP GPUs running
> as PCI ones, since fixing some issues (not all) on PCI side raises
> new issues with AGP GPUs running as PCI ones but not on native PCI
> GPUs (see below).
>
> Bugs aside, one thing that is important to consider against the AGP
> disablement is that there is such hardware that is very capable and
> not that old out there. For example the ATI Radeon HD 4670 AGP
> (RV730 XT) was still sold brand new after 2010 and is a powerful
> and featureful GPUs with 1GB of VRAM and HDMI port. Performance with
> it is still pretty decent on competitive games. To compare with other
>   open source drivers mainlined in Linux, to outperform this GPU an
>   user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.
>
> Also, yet another thing that is important to consider against AGP
> disablement is that if PCI Express was introduced in 2004, there
> was still AGP compatible hardware being designed, produced and sold
> very lately, especially on AMD side. Computers with quad core 64-bit
> CPUs with virtualisation, 16GB of RAM and AGPs exist, and this is
> widely distributed consumer hardware, not specific esoteric hardware.
>
> So, not only powerful AGP GPUs were still sold brand new in the current
> decade, but there was also very capable computers to host them. Because
> of those AGP computers, fixing PCI GPUs fallback is not a solution
> because PCI fallback is not a solution.
>
> All that range of hardware became unusable with that commit disabling
> AGP, without alternative.
>
> Not only those AGP GPUs don't work with kernel's PCI fallback, but
> unplugging those AGP GPUs and plugging physical PCI-native GPUs
> instead does not work.
>
> You'll find more details about the various issues on those bugs, I've
> invested multiple full time day to test and reproduce bugs on a wide
> range of hardware, I've attached, quoted and commented a lot of logs:
>
> - https://bugs.launchpad.net/bugs/1899304
>> AGP disablement leaves GPUs without working alternative
>> (PCI fallback is broken), makes very-capable ATI TeraScale GPUs
>> unusable
> - https://bugs.launchpad.net/bugs/1902981
>> AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
>> time) are known to fail on K8 and K10 platforms
> - https://bugs.launchpad.net/bugs/1902795
>> PCI graphics broken on AMD K8/K10/Piledriver platform (while it
>> works on Intel) verified from Linux 4.4 to 5.10-rc2
> I wish to be personally CC'ed the answers/comments posted to the list
> in response to my posting.
>
> Thank you for your attention.
>
> -- 
> Thomas “illwieckz” Debesse


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: On disabling AGP without working alternative (PCI fallback is broken for years)
  2020-11-09 11:40 On disabling AGP without working alternative (PCI fallback is broken for years) Thomas “illwieckz“ Debesse
  2020-11-09 13:57 ` Christian König
@ 2020-11-09 17:37 ` Deucher, Alexander
  2021-05-06  5:37   ` On disabling AGP without working alternative (PCI and PCIe are also affected) Thomas “illwieckz“ Debesse
  1 sibling, 1 reply; 8+ messages in thread
From: Deucher, Alexander @ 2020-11-09 17:37 UTC (permalink / raw)
  To: Thomas “illwieckz“ Debesse, LKML; +Cc: Koenig, Christian

[AMD Public Use]

> -----Original Message-----
> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
> Sent: Monday, November 9, 2020 6:41 AM
> To: LKML <linux-kernel@vger.kernel.org>
> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Subject: On disabling AGP without working alternative (PCI fallback is 
> broken for years)
> 
> Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP in 
> default build.
> 
> It was signed-off by Christian König and Reviewed by Alex Deucher.
> Distributions started to backport this commit, and it seems to have 
> happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was 
> built on Sep 10 2020.
> 
> Around that time I noticed AGP computers experiencing lock-ups and 
> other problems making them unusable after the upgrade. After 
> investigating what was happening bisecting Linux versions, I reverted 
> the commit and those computers were working again.
> 
> Commit message was:
> 
> > This means a performance regression for some GPUs, but also a bug 
> > fix for some others.
> 
> Unfortunately, this commit does not only introduce a performance 
> regression but makes some computers unusable, maybe all computers with 
> AMD CPUs.
> 
> One of the root cause may be that PCI GPUs are broken for years on AMD 
> platforms, it was tested and verified on:
> 
> - K8-based computer with AGP
> - K8-based computer with PCI Express
> - K10-based computer with AGP
> - Piledriver-based computer with PCI Express
> 
> The breakage was tested and reproduced from Linux 4.4 to Linux
> 5.10-rc2 (I have not tried older than 4.4).
> 
> PCI GPUs may be broken on some other platforms, but I have found that 
> testing on an Intel PC (with PCI Express) does not reproduce the issue 
> when the PCI GPU hardware is plugged in.
> 
> There is two patches I'm requesting comments for:
> 
> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
> 
> https://lkml.org/lkml/2020/11/5/307
> 
> This one is not enough to fix PCI GPUs but it is enough to prevent to 
> fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs 
> can't be fixed by this, and this uncovers other bug with AGP GPUs when 
> AGP is disabled at build time. Also, this patch may makes PCI GPUs 
> working on a non-optimal way on platform that accepts them with 40-bit 
> DMA bit mask (like Intel- based computers that already work without any patch).
> 
> This patch is inspired from the patch made to solve that issue from
> 2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375
> 
> At the time, such change may have been enough to fix the issue, it's 
> not true any more. More breakage may have been introduced since.
> 
> Also, maybe this patch becomes useless when other PCI bugs are fixed, 
> who knows? At least, this is an entry-point for investigations.

I think you may be seeing fallout from this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33b3ad3788aba846fc8b9a065fe2685a0b64f713
That patch lead to screen corruption and other issues on older radeons.  It seemed to be related to AGP and/or HIMEM.  Disabling either of those fixes the issues.
I proposed reverting the change, but there was push back to find the root cause:
https://www.spinics.net/lists/stable/msg413960.html


> 
> ## Revert "drm/radeon: disable AGP by default"
> 
> https://lkml.org/lkml/2020/11/5/308
> 
> This is the simple fix but currently only solution to make AMD hosts 
> with AGP port to get a display again, as without this reverts, those 
> computers do not have any alternative to run a display (even not PCI GPUs).
> 
> I'm asking for comments on those patches. I may have reached my own 
> skill cap on kernel development anyway. I can repurpose hardware to 
> test any other patch and can contribute time for such testing. Unlike 
> AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time 
> and availability offered.
> 
> The PCI GPU on AMD CPU issue was verified with both Nvidia (GS 8400GS
> rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU sample not being 
> old cards from the previous millennial but capable
> ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98 on 
> Nvidia side. They can both do OpenGL 3.3 and feature both 512M of 
> VRAM. The ATI one had HDMI port, and it is known some variant of the 
> Nvidia one (not the one I own but same specification) had HDMI port too.
> 
> Also, fixing PCI GPUs may not be enough to fix AGP GPUs running as PCI 
> ones, since fixing some issues (not all) on PCI side raises new issues 
> with AGP GPUs running as PCI ones but not on native PCI GPUs (see below).
> 
> Bugs aside, one thing that is important to consider against the AGP 
> disablement is that there is such hardware that is very capable and 
> not that old out there. For example the ATI Radeon HD 4670 AGP
> (RV730 XT) was still sold brand new after 2010 and is a powerful and 
> featureful GPUs with 1GB of VRAM and HDMI port. Performance with it is 
> still pretty decent on competitive games. To compare with other
>  open source drivers mainlined in Linux, to outperform this GPU an
>  user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.
> 
> Also, yet another thing that is important to consider against AGP 
> disablement is that if PCI Express was introduced in 2004, there was 
> still AGP compatible hardware being designed, produced and sold very 
> lately, especially on AMD side. Computers with quad core 64-bit CPUs 
> with virtualisation, 16GB of RAM and AGPs exist, and this is widely 
> distributed consumer hardware, not specific esoteric hardware.
> 
> So, not only powerful AGP GPUs were still sold brand new in the 
> current decade, but there was also very capable computers to host 
> them. Because of those AGP computers, fixing PCI GPUs fallback is not 
> a solution because PCI fallback is not a solution.
> 

For newer AGP hardware like the RV730 you point out (or anything newer than R300), there is no reason to run AGP mode.  The on chip GART is far superior.  The only chips where performance may be a problem is the older R1xx/R2xx radeons, and the issue there is more around the size of the TLB on the on chip GART vs the TLB in the AGP bridge. Also as Christian mentioned, AGP is PCI so if PCI doesn't work, you have bigger problems.

Alex


> All that range of hardware became unusable with that commit disabling 
> AGP, without alternative.
> 
> Not only those AGP GPUs don't work with kernel's PCI fallback, but 
> unplugging those AGP GPUs and plugging physical PCI-native GPUs 
> instead does not work.
> 
> You'll find more details about the various issues on those bugs, I've 
> invested multiple full time day to test and reproduce bugs on a wide 
> range of hardware, I've attached, quoted and commented a lot of logs:
> 
> - https://bugs.launchpad.net/bugs/1899304
> > AGP disablement leaves GPUs without working alternative (PCI 
> > fallback is broken), makes very-capable ATI TeraScale GPUs unusable
> 
> - https://bugs.launchpad.net/bugs/1902981
> > AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
> > time) are known to fail on K8 and K10 platforms
> 
> - https://bugs.launchpad.net/bugs/1902795
> > PCI graphics broken on AMD K8/K10/Piledriver platform (while it 
> > works on Intel) verified from Linux 4.4 to 5.10-rc2
> 
> I wish to be personally CC'ed the answers/comments posted to the list 
> in response to my posting.
> 
> Thank you for your attention.
> 
> --
> Thomas “illwieckz” Debesse

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: On disabling AGP without working alternative (PCI and PCIe are also affected)
  2020-11-09 17:37 ` Deucher, Alexander
@ 2021-05-06  5:37   ` Thomas “illwieckz“ Debesse
  2021-05-13 18:42     ` Thomas “illwieckz“ Debesse
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas “illwieckz“ Debesse @ 2021-05-06  5:37 UTC (permalink / raw)
  To: Deucher, Alexander, Koenig, Christian, LKML

Hi! First, thank you both Alex and Christian for your answers. Since
that time I did more experiments with more hardware. And… the new
knowledge is that the bug affects PCIe cards on PCIe slot when the host
also offers AGP slot, meaning those computers have to rely on AGP cards
(or PCI ones) instead of PCIe ones to get a working and usable desktop.
This bug is older than the AGP disablement that happened with 5.9-rc1
(out of curiosity I also reproduced the bug on Linux 4.15 from Ubuntu
Xenial, for example).

We now know the issue affects both AGP, PCI, and PCIe. We know that the
issue affects both ATI/AMD and Nvidia hardware. The bugs occur given
this or that host / card combination, while everything works with other
combinations of the same hosts and cards (all of them being validated to
be working at some point). We know that, for PCI-related issues, it even
affects post-AGP AMD platforms like Bulldozer/Piledriver.

The PCIe issue affects both ATI/AMD and Nvidia GPUs and the symptoms are
consistent with the ones experienced with legacy PCI ATI/AMD and Nvidia
GPUs on platform we know PCI GPUs do not work. When I talk about GPUs
with legacy PCI port I talk about TeraScale/Tesla ones with GL3 support,
512MB of VRAM and HDMI support, not ones from the Rage era, just to make
sure this is clear enough.

Among the PCIe GPUs I managed to reproduce the bug with, I can name:

- ATI Radeon HD 5870 Eyefinity 6 (RV870 Cypress XT, TeraScale 2, OpenGL
4.3, 2GB of VRAM, 6 miniDP, released 2010-03).
- AMD FirePro 3D V4800 (RV830 Redwood XT GL, Terascale2, OpenGL 3.3, 1GB
VRAM, 1 DVI-I + 2 DP, released 2010-04).
- AMD Radeon HD 6970 (RV910 Caicos, TeraScale 3, OpenGL 4.3, 2GB VRAM, 2
DVI-D + 1 HDMI + 2 miniDP, released 2010-12).

Those GPUs don't have PCI or AGP counterparts anyway. And we know some
Linux PCI bugs can even affect TeraScale 3 PCIe GPUs with OpenGL 4 and
OpenCL support.

We also know some Nvidia GPU (PCI and PCIe) can be affected.

It looks like the problems are less on the GPU side, but more on the PCI
platform (bridge? chipset? I don't know). So while AMD is more affected,
and maybe AMD specific code is affected since AMD symptoms are different
than Nvidia symptoms, even combinations to reproduce bugs may differ
given the brand, my attention is now focused on PCI.

## What we already knew at the time of previous emails ##

1. PCI ATI/AMD Radeon and PCI Nvidia GPUs don't work on AMD platform,
this was verified on K8 (AMD Athlon 64 3200+, AMD Athlon 64 X2), K10
(AMD Phenom II X4 970 8 core) and Piledriver (FX-9590 8 core), so we can
probably assume all AMD64 platforms before Ryzen, Ryzen-based
motherboard usually don't have legacy PCI anyway (I've heard of the
Biostar Racing X470GTA motherboard but well, it's more rare than legacy
PCI and AGP).

So there are known issues with legacy PCI on pre-Ryzen architecture,
including bulldozer ones, which is far more recent than AGP.

At the same time, those PCI ATI/AMD Radeon and PCI Nvidia GPUs work on
Intel platforms, this was verified on multiple hosts including ones with
Pentium E5200 dual core (with Intel 82801) and Core 2 Quad Q6600 (with
VIA PT880/VT82xx) to name somes.

The brokenness of PCI ATI/AMD Radeon GPU on AMD platform is unrelated to
the platform offering AGP or PCIe. Both K8 with AGP and K8 with PCIe
behaves the same when a PCI ATI/AMD Radeon GPU is plugged in. The PCI
Radeon sample used for the test is a Radeon HD 4350 which is fairly
capable: TeraScale, GL 3.3, 512M VRAM, HDMI.

2. AGP Radeon cards stopped to work on AMD platforms when AGP was
disabled by default in 5.9-rc1. The only way to make them work is to use
radeon.agpmode=1 kernel command line option. Because distributions like
Ubuntu LTS distributed the patch backported on 5.8 kernel (either
because they backported it themselves or kernel developers did it
upstream, I don't know), after the update the computers were not able to
complete the boot because they never reached the desktop. This affected
pretty capable computers like the one I previously quoted, which runs
the quad core AMD AM3 Phenom II CPU X4 970 (3.5GHz) with 16GB of RAM and
featuring AGP Radeon HD 4600 (TeraScale, GL 3.3, 1GB VRAM, HDMI).

If I use startx with lxsession I can get a working X.org environement,
but that is super slow and not really usable. Very slow disk IO are
reported and audio glitches are experienced, this has side-effects on a
wider scale than the sole display. Starting a more complexe environment
like GNOME will just make the computer unusable.

3. AGP Radeon cards running as PCI cards on AMD platforms display the
same broken behaviour we can see with PCI Radeon cards, which is not
surprising given it is expected they would run the same way. At least
this prediction got verified. ATI/AMD AGP as PCI on AMD platform is as
broken as ATI/AMD PCI on AMD platform.

4. Some problems were said to have been noticed by kernel developers
with latest AMD Radeon hardware and it was said disabling AGP improved
the support for those recent cards, that's why AGP was disabled starting
with 5.9-rc1 in hope to fix the latest AMD Radeon hardware (but that
brokes older ones in the process).

## What is new knowledge since that time ##

1. PCIe ATI/AMD Radeon GPUs running on Intel host having both AGP and
PCIe slots do not work and displays the same broken behaviour we can see
with AGP Radeon cards running as PCI cards, or PCI Radeon cards on AMD
platform.

This was verified with a wide range of AMD/ATI PCIe GPUs, both consumer
Radeon or professionnal FirePro cards.

Interestingly, the testbed is an Intel-based platform (Core 2 Quad
Q6600) and then PCI Radeon cards work as we seen with other Intel based
hosts. If I'm right, AGP cards seems to work as PCI ones as well. That
may makes sense because previously I did not test AGP Radeon cards on
Intel platforms.

But then, as I said, the PCIe Radeon cards just fails as AGP ones
running as PCI and PCI ones on AMD platforms. But for them,
radeon.agpmode=1 does not make sense, so there is no solution.

By failing I mean grub displays things correctly, then linux displays
things correctly, including framebuffer, but when X.org starts and open
the desktop, some background is painted but the desktop never complete
the startup. The mouse pointer can be moved with the mouse but that's
all. The tested desktop is GNOME, and the shell itself does not display.

So, with all those experiments done, and with all that knowledge, it
appears there is some serious issues in the PCI code.

Note: on some very old cards (like the 9700 Pro from 2002) with
radeon.agpmode -1 (AGP as PCI), the symptom is different, the desktop
loads properly, but loading a game (Unvanquished) leads to a GPU lockup.
While on newer hardware like ATI Radeon HD 4670 AGP (which was still
sold as new in 2012), the desktop won't load, displaying the exact same
symptoms as ATI/AMD PCI on ATI/AMD CPU. If I'm right, the old ones like
the 9700 Pro is likely to be a native AGP card, while the latest ones
like the HD 4670 may be natively PCIe with a bridge on the AGP card.
Maybe that can ring a bell…

2. PCIe Ndidia GPUs on Intel host having both AGP and PCIe slots do not
work and displays the same broken behaviour we can see with PCI Nvidia
card running on AMD host. The graphical glitches are exactly the sames
with Nvidia PCI on the AMD host and Nvidia PCIe on the Intel host (but
not the same as PCI and AGP-as-PCI and PCIe Radeon symptoms).

So, we can reproduce the Nvidia-specific glitches with both PCI and
PCIe, and we can reproduce the ATI/AMD-specific symptoms with both PCI,
AGP, and PCIe.

## Various answers and questions ##

Christian Köning said:

> That is interesting but doesn't make much sense from the technical
perspective.
> See AGP is build on top of PCI, if PCI doesn't work AGP won't work
either. So why should AGP work while PCI doesn't?

Now we know that both PCI, AGP and PCIe are affected. Which makes sense.

What's makes hard to track the bugs is that the bugs may occur or not
occur given the host and cards combination. This is probably about
GPU/PCI bridge combination (and motherboard chipset when it makes sense)
or things like that.

Alex Deucher said:

> For newer AGP hardware like the RV730 you point out (or anything newer
than R300), there is no reason to run AGP mode. The on chip GART is far
superior.

So, on paper, AGP-as-PCI is expected to work, and on paper again, some
of those card may even work better this way. Experience currently
displays the exact opposite, which not only means there is bugs
somewhere, but also that the behaviour of the PCI code is unpredictable,
because predictions fail.

Christian Köning said:

> We simply don't have the time to support that older GPU and disabling
AGP fixed quite a number of them.

Was disabling AGP motivated by some issues with identified causes and it
was decided to not fix them, or was disabling AGP motivated by the
observation it fixed some other issues but without identifying the causes?

What's now interesting is that on some PCIe-compatible platforms, PCIe
is broken and AGP is the working fallback, and now that AGP is disabled
by default in code, none work out of the box.

I can understand how it would be easier to not support older hardware,
but on the other hand what's the purpose of Mesa/RadeonSI supporting
them on the userland side if the kernel can't host the hardware to begin
with?

Also, that may be seen as unfortunate, but AGP is not only about Rage
128 cards or those very very old thing that would not fullfill current
needs. Unlike Nvidia, there was AMD/ATI AGP hardware that were produced
and sold very lately and those are still capable to fullfill current
needs. At the same time, AMD ensured very good compatibility of it's
hardware, that's why it was possible to have the quoted quad core AM3
Phenom II on an AM2 motherboard with AGP for example. This is precisely
why AMD is appreciated by customers, not like Intel with frustrating
market segmentation where, for example (real use case), one Pentium
E5200 with IGP can support OpenGL 3 but not virtualization, while
another Pentium E5200 with IGP cannot support OpenGL 3 but supports
virtualization, or (another real example), supporting PAE while hiding
it to the operating system. Buying AMD is all about not having to choose
between this or this feature, and buying AMD is all about being able to
get hardware that works over multiple hardware generations.

But anyway, outside of those considerations, it now appears the PCI code
has serious issues and the behaviour can't be predicted. Newer hardware
may be working, but do we know how much luck is involved?

I may be busy, doing those extra tests and reporting the results took me
some extra months, but at least, I have access to a wide range of
hardware to test any patch that would aim to fix the
PCI/AGP/PCIe-related bugs. I would be happy to help on that topic. AGP
is just one aspect of it, now we know those PCI-related bugs affect
legacy PCI and PCI express as well.

Christian Köning said:

> We simply can't invest time maintaining a technology which is
deprecated for nearly 15 years now.

It now appears that the bugs not only affect AGP and PCI but also PCI
Express. AGP disclosed those bugs, but PCI seems to be at fault there.

One interesting thing is that some ATI/AMD cards on AMD hosts are more
buggy than the same ATI/AMD cards on Intel hosts. The underlying bugs
may even not be related to the cards themselves but on the host PCI code
(chipsets, PCI bridges or things like that).

Note: one interesting thing is that I have access to two Radeon HD 4670,
one AGP model, one PCIe model, from the same vendor, exact same
generation, vendor and model, just one being AGP and one being PCIe
variant. On the same Intel-based motherboard supporting both AGP and
PCIe, only the AGP model works. The PCIe model is not faulty, it works
as expected on AMD-based motherboards only having an AGP port and no
PCIe port. Getting things working seems to be about luck, not about what
the implementation is said to do.

I also have access to two X1950 pro, one PCIe, one AGP, from different
vendors, though this one is less interesting because not a TeraScale
one. But this may be useful for testing because I can test both on the
same motherboard having both an AGP and a PCIe slot. Currently, only AGP
works on that host anyway because when using AMD/ATI or Nvidia PCIe GPUs
on that Intel host I reproduce the issues I get with AMD/ATI or Nvidia
PCI  GPUs on AMD host…

Who are the ones working on the PCI platform code? Maybe those would be
better interlocutors, it looks like the issue is not AMD specific, it
affects Nvidia GPUs and Intel platforms as well.

Is there options similar to radeon.agpmode but for PCI / PCI Express I
can experiment with?

I'll build one day the latest vanilla kernel to reproduce the issues and
probably open a ticket on the kernel bugzilla regarding PCI-related
problems in general (even if there is AMD-specific variants of the bug).
That would be a good start.

Thank your very much for your attention, best regards,

PS: I wish to be personally CC'ed the answers/comments posted to the
list in response to my posting.

--
Thomas “illwieckz” Debesse
Le 09/11/2020 à 18:37, Deucher, Alexander a écrit :
> [AMD Public Use]
> 
>> -----Original Message-----
>> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
>> Sent: Monday, November 9, 2020 6:41 AM
>> To: LKML <linux-kernel@vger.kernel.org>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander 
>> <Alexander.Deucher@amd.com>
>> Subject: On disabling AGP without working alternative (PCI fallback is 
>> broken for years)
>>
>> Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP in 
>> default build.
>>
>> It was signed-off by Christian König and Reviewed by Alex Deucher.
>> Distributions started to backport this commit, and it seems to have 
>> happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was 
>> built on Sep 10 2020.
>>
>> Around that time I noticed AGP computers experiencing lock-ups and 
>> other problems making them unusable after the upgrade. After 
>> investigating what was happening bisecting Linux versions, I reverted 
>> the commit and those computers were working again.
>>
>> Commit message was:
>>
>>> This means a performance regression for some GPUs, but also a bug 
>>> fix for some others.
>>
>> Unfortunately, this commit does not only introduce a performance 
>> regression but makes some computers unusable, maybe all computers with 
>> AMD CPUs.
>>
>> One of the root cause may be that PCI GPUs are broken for years on AMD 
>> platforms, it was tested and verified on:
>>
>> - K8-based computer with AGP
>> - K8-based computer with PCI Express
>> - K10-based computer with AGP
>> - Piledriver-based computer with PCI Express
>>
>> The breakage was tested and reproduced from Linux 4.4 to Linux
>> 5.10-rc2 (I have not tried older than 4.4).
>>
>> PCI GPUs may be broken on some other platforms, but I have found that 
>> testing on an Intel PC (with PCI Express) does not reproduce the issue 
>> when the PCI GPU hardware is plugged in.
>>
>> There is two patches I'm requesting comments for:
>>
>> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
>>
>> https://lkml.org/lkml/2020/11/5/307
>>
>> This one is not enough to fix PCI GPUs but it is enough to prevent to 
>> fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs 
>> can't be fixed by this, and this uncovers other bug with AGP GPUs when 
>> AGP is disabled at build time. Also, this patch may makes PCI GPUs 
>> working on a non-optimal way on platform that accepts them with 40-bit 
>> DMA bit mask (like Intel- based computers that already work without any patch).
>>
>> This patch is inspired from the patch made to solve that issue from
>> 2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375
>>
>> At the time, such change may have been enough to fix the issue, it's 
>> not true any more. More breakage may have been introduced since.
>>
>> Also, maybe this patch becomes useless when other PCI bugs are fixed, 
>> who knows? At least, this is an entry-point for investigations.
> 
> I think you may be seeing fallout from this patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33b3ad3788aba846fc8b9a065fe2685a0b64f713
> That patch lead to screen corruption and other issues on older radeons.  It seemed to be related to AGP and/or HIMEM.  Disabling either of those fixes the issues.
> I proposed reverting the change, but there was push back to find the root cause:
> https://www.spinics.net/lists/stable/msg413960.html
> 
> 
>>
>> ## Revert "drm/radeon: disable AGP by default"
>>
>> https://lkml.org/lkml/2020/11/5/308
>>
>> This is the simple fix but currently only solution to make AMD hosts 
>> with AGP port to get a display again, as without this reverts, those 
>> computers do not have any alternative to run a display (even not PCI GPUs).
>>
>> I'm asking for comments on those patches. I may have reached my own 
>> skill cap on kernel development anyway. I can repurpose hardware to 
>> test any other patch and can contribute time for such testing. Unlike 
>> AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time 
>> and availability offered.
>>
>> The PCI GPU on AMD CPU issue was verified with both Nvidia (GS 8400GS
>> rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU sample not being 
>> old cards from the previous millennial but capable
>> ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98 on 
>> Nvidia side. They can both do OpenGL 3.3 and feature both 512M of 
>> VRAM. The ATI one had HDMI port, and it is known some variant of the 
>> Nvidia one (not the one I own but same specification) had HDMI port too.
>>
>> Also, fixing PCI GPUs may not be enough to fix AGP GPUs running as PCI 
>> ones, since fixing some issues (not all) on PCI side raises new issues 
>> with AGP GPUs running as PCI ones but not on native PCI GPUs (see below).
>>
>> Bugs aside, one thing that is important to consider against the AGP 
>> disablement is that there is such hardware that is very capable and 
>> not that old out there. For example the ATI Radeon HD 4670 AGP
>> (RV730 XT) was still sold brand new after 2010 and is a powerful and 
>> featureful GPUs with 1GB of VRAM and HDMI port. Performance with it is 
>> still pretty decent on competitive games. To compare with other
>>  open source drivers mainlined in Linux, to outperform this GPU an
>>  user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.
>>
>> Also, yet another thing that is important to consider against AGP 
>> disablement is that if PCI Express was introduced in 2004, there was 
>> still AGP compatible hardware being designed, produced and sold very 
>> lately, especially on AMD side. Computers with quad core 64-bit CPUs 
>> with virtualisation, 16GB of RAM and AGPs exist, and this is widely 
>> distributed consumer hardware, not specific esoteric hardware.
>>
>> So, not only powerful AGP GPUs were still sold brand new in the 
>> current decade, but there was also very capable computers to host 
>> them. Because of those AGP computers, fixing PCI GPUs fallback is not 
>> a solution because PCI fallback is not a solution.
>>
> 
> For newer AGP hardware like the RV730 you point out (or anything newer than R300), there is no reason to run AGP mode.  The on chip GART is far superior.  The only chips where performance may be a problem is the older R1xx/R2xx radeons, and the issue there is more around the size of the TLB on the on chip GART vs the TLB in the AGP bridge. Also as Christian mentioned, AGP is PCI so if PCI doesn't work, you have bigger problems.
> 
> Alex
> 
> 
>> All that range of hardware became unusable with that commit disabling 
>> AGP, without alternative.
>>
>> Not only those AGP GPUs don't work with kernel's PCI fallback, but 
>> unplugging those AGP GPUs and plugging physical PCI-native GPUs 
>> instead does not work.
>>
>> You'll find more details about the various issues on those bugs, I've 
>> invested multiple full time day to test and reproduce bugs on a wide 
>> range of hardware, I've attached, quoted and commented a lot of logs:
>>
>> - https://bugs.launchpad.net/bugs/1899304
>>> AGP disablement leaves GPUs without working alternative (PCI 
>>> fallback is broken), makes very-capable ATI TeraScale GPUs unusable
>>
>> - https://bugs.launchpad.net/bugs/1902981
>>> AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
>>> time) are known to fail on K8 and K10 platforms
>>
>> - https://bugs.launchpad.net/bugs/1902795
>>> PCI graphics broken on AMD K8/K10/Piledriver platform (while it 
>>> works on Intel) verified from Linux 4.4 to 5.10-rc2
>>
>> I wish to be personally CC'ed the answers/comments posted to the list 
>> in response to my posting.
>>
>> Thank you for your attention.
>>
>> --
>> Thomas “illwieckz” Debesse

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: On disabling AGP without working alternative (PCI and PCIe are also affected)
  2021-05-06  5:37   ` On disabling AGP without working alternative (PCI and PCIe are also affected) Thomas “illwieckz“ Debesse
@ 2021-05-13 18:42     ` Thomas “illwieckz“ Debesse
  2021-05-13 19:02       ` Deucher, Alexander
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas “illwieckz“ Debesse @ 2021-05-13 18:42 UTC (permalink / raw)
  To: Deucher, Alexander, Koenig, Christian, LKML

Erratum: In my previous e-mail I forgot to remove that sentence about
Intel platform:

> If I'm right, AGP cards seems to work as PCI ones as well. That
> may makes sense because previously I did not test AGP Radeon cards on
> Intel platforms.

I first wrote that after having beeing fooled by a lucky hardware
combination. It exchanged immediate failure at desktop load for later
computer reboot at game load (but with successfully loaded desktop,
hence the illusion it worked at first).

The valid statement, after more testing, is (quoted from the previous
mail which had then contradictory statements):

> Note: on some very old cards (like the 9700 Pro from 2002) with
> radeon.agpmode -1 (AGP as PCI), the symptom is different, the desktop
> loads properly, but loading a game (Unvanquished) leads to a GPU
> lockup.
> While on newer hardware like ATI Radeon HD 4670 AGP (which was still
> sold as new in 2012), the desktop won't load, displaying the exact
> same symptoms as ATI/AMD PCI on ATI/AMD CPU. If I'm right, the old
> ones like the 9700 Pro is likely to be a native AGP card, while the
> latest ones like the HD 4670 may be natively PCIe with a bridge on the
> AGP card.
> Maybe that can ring a bell…

Not only the GPU can lockup, but the computer may also crash and reboot
in that scenario. The same hardware and software combination works
perfectly when AGP is enabled.

Those AMD/ATI AGP cards are then known to be affected on both AMD and
Intel host platforms.

Best regards,

PS: I wish to be personally CC'ed the answers/comments posted to the
list in response to my posting.

--
Thomas “illwieckz” Debesse

Le 06/05/2021 à 07:37, Thomas “illwieckz“ Debesse a écrit :
> Hi! First, thank you both Alex and Christian for your answers. Since
> that time I did more experiments with more hardware. And… the new
> knowledge is that the bug affects PCIe cards on PCIe slot when the host
> also offers AGP slot, meaning those computers have to rely on AGP cards
> (or PCI ones) instead of PCIe ones to get a working and usable desktop.
> This bug is older than the AGP disablement that happened with 5.9-rc1
> (out of curiosity I also reproduced the bug on Linux 4.15 from Ubuntu
> Xenial, for example).
> 
> We now know the issue affects both AGP, PCI, and PCIe. We know that the
> issue affects both ATI/AMD and Nvidia hardware. The bugs occur given
> this or that host / card combination, while everything works with other
> combinations of the same hosts and cards (all of them being validated to
> be working at some point). We know that, for PCI-related issues, it even
> affects post-AGP AMD platforms like Bulldozer/Piledriver.
> 
> The PCIe issue affects both ATI/AMD and Nvidia GPUs and the symptoms are
> consistent with the ones experienced with legacy PCI ATI/AMD and Nvidia
> GPUs on platform we know PCI GPUs do not work. When I talk about GPUs
> with legacy PCI port I talk about TeraScale/Tesla ones with GL3 support,
> 512MB of VRAM and HDMI support, not ones from the Rage era, just to make
> sure this is clear enough.
> 
> Among the PCIe GPUs I managed to reproduce the bug with, I can name:
> 
> - ATI Radeon HD 5870 Eyefinity 6 (RV870 Cypress XT, TeraScale 2, OpenGL
> 4.3, 2GB of VRAM, 6 miniDP, released 2010-03).
> - AMD FirePro 3D V4800 (RV830 Redwood XT GL, Terascale2, OpenGL 3.3, 1GB
> VRAM, 1 DVI-I + 2 DP, released 2010-04).
> - AMD Radeon HD 6970 (RV910 Caicos, TeraScale 3, OpenGL 4.3, 2GB VRAM, 2
> DVI-D + 1 HDMI + 2 miniDP, released 2010-12).
> 
> Those GPUs don't have PCI or AGP counterparts anyway. And we know some
> Linux PCI bugs can even affect TeraScale 3 PCIe GPUs with OpenGL 4 and
> OpenCL support.
> 
> We also know some Nvidia GPU (PCI and PCIe) can be affected.
> 
> It looks like the problems are less on the GPU side, but more on the PCI
> platform (bridge? chipset? I don't know). So while AMD is more affected,
> and maybe AMD specific code is affected since AMD symptoms are different
> than Nvidia symptoms, even combinations to reproduce bugs may differ
> given the brand, my attention is now focused on PCI.
> 
> ## What we already knew at the time of previous emails ##
> 
> 1. PCI ATI/AMD Radeon and PCI Nvidia GPUs don't work on AMD platform,
> this was verified on K8 (AMD Athlon 64 3200+, AMD Athlon 64 X2), K10
> (AMD Phenom II X4 970 8 core) and Piledriver (FX-9590 8 core), so we can
> probably assume all AMD64 platforms before Ryzen, Ryzen-based
> motherboard usually don't have legacy PCI anyway (I've heard of the
> Biostar Racing X470GTA motherboard but well, it's more rare than legacy
> PCI and AGP).
> 
> So there are known issues with legacy PCI on pre-Ryzen architecture,
> including bulldozer ones, which is far more recent than AGP.
> 
> At the same time, those PCI ATI/AMD Radeon and PCI Nvidia GPUs work on
> Intel platforms, this was verified on multiple hosts including ones with
> Pentium E5200 dual core (with Intel 82801) and Core 2 Quad Q6600 (with
> VIA PT880/VT82xx) to name somes.
> 
> The brokenness of PCI ATI/AMD Radeon GPU on AMD platform is unrelated to
> the platform offering AGP or PCIe. Both K8 with AGP and K8 with PCIe
> behaves the same when a PCI ATI/AMD Radeon GPU is plugged in. The PCI
> Radeon sample used for the test is a Radeon HD 4350 which is fairly
> capable: TeraScale, GL 3.3, 512M VRAM, HDMI.
> 
> 2. AGP Radeon cards stopped to work on AMD platforms when AGP was
> disabled by default in 5.9-rc1. The only way to make them work is to use
> radeon.agpmode=1 kernel command line option. Because distributions like
> Ubuntu LTS distributed the patch backported on 5.8 kernel (either
> because they backported it themselves or kernel developers did it
> upstream, I don't know), after the update the computers were not able to
> complete the boot because they never reached the desktop. This affected
> pretty capable computers like the one I previously quoted, which runs
> the quad core AMD AM3 Phenom II CPU X4 970 (3.5GHz) with 16GB of RAM and
> featuring AGP Radeon HD 4600 (TeraScale, GL 3.3, 1GB VRAM, HDMI).
> 
> If I use startx with lxsession I can get a working X.org environement,
> but that is super slow and not really usable. Very slow disk IO are
> reported and audio glitches are experienced, this has side-effects on a
> wider scale than the sole display. Starting a more complexe environment
> like GNOME will just make the computer unusable.
> 
> 3. AGP Radeon cards running as PCI cards on AMD platforms display the
> same broken behaviour we can see with PCI Radeon cards, which is not
> surprising given it is expected they would run the same way. At least
> this prediction got verified. ATI/AMD AGP as PCI on AMD platform is as
> broken as ATI/AMD PCI on AMD platform.
> 
> 4. Some problems were said to have been noticed by kernel developers
> with latest AMD Radeon hardware and it was said disabling AGP improved
> the support for those recent cards, that's why AGP was disabled starting
> with 5.9-rc1 in hope to fix the latest AMD Radeon hardware (but that
> brokes older ones in the process).
> 
> ## What is new knowledge since that time ##
> 
> 1. PCIe ATI/AMD Radeon GPUs running on Intel host having both AGP and
> PCIe slots do not work and displays the same broken behaviour we can see
> with AGP Radeon cards running as PCI cards, or PCI Radeon cards on AMD
> platform.
> 
> This was verified with a wide range of AMD/ATI PCIe GPUs, both consumer
> Radeon or professionnal FirePro cards.
> 
> Interestingly, the testbed is an Intel-based platform (Core 2 Quad
> Q6600) and then PCI Radeon cards work as we seen with other Intel based
> hosts. If I'm right, AGP cards seems to work as PCI ones as well. That
> may makes sense because previously I did not test AGP Radeon cards on
> Intel platforms.
> 
> But then, as I said, the PCIe Radeon cards just fails as AGP ones
> running as PCI and PCI ones on AMD platforms. But for them,
> radeon.agpmode=1 does not make sense, so there is no solution.
> 
> By failing I mean grub displays things correctly, then linux displays
> things correctly, including framebuffer, but when X.org starts and open
> the desktop, some background is painted but the desktop never complete
> the startup. The mouse pointer can be moved with the mouse but that's
> all. The tested desktop is GNOME, and the shell itself does not display.
> 
> So, with all those experiments done, and with all that knowledge, it
> appears there is some serious issues in the PCI code.
> 
> Note: on some very old cards (like the 9700 Pro from 2002) with
> radeon.agpmode -1 (AGP as PCI), the symptom is different, the desktop
> loads properly, but loading a game (Unvanquished) leads to a GPU lockup.
> While on newer hardware like ATI Radeon HD 4670 AGP (which was still
> sold as new in 2012), the desktop won't load, displaying the exact same
> symptoms as ATI/AMD PCI on ATI/AMD CPU. If I'm right, the old ones like
> the 9700 Pro is likely to be a native AGP card, while the latest ones
> like the HD 4670 may be natively PCIe with a bridge on the AGP card.
> Maybe that can ring a bell…
> 
> 2. PCIe Ndidia GPUs on Intel host having both AGP and PCIe slots do not
> work and displays the same broken behaviour we can see with PCI Nvidia
> card running on AMD host. The graphical glitches are exactly the sames
> with Nvidia PCI on the AMD host and Nvidia PCIe on the Intel host (but
> not the same as PCI and AGP-as-PCI and PCIe Radeon symptoms).
> 
> So, we can reproduce the Nvidia-specific glitches with both PCI and
> PCIe, and we can reproduce the ATI/AMD-specific symptoms with both PCI,
> AGP, and PCIe.
> 
> ## Various answers and questions ##
> 
> Christian Köning said:
> 
>> That is interesting but doesn't make much sense from the technical
> perspective.
>> See AGP is build on top of PCI, if PCI doesn't work AGP won't work
> either. So why should AGP work while PCI doesn't?
> 
> Now we know that both PCI, AGP and PCIe are affected. Which makes sense.
> 
> What's makes hard to track the bugs is that the bugs may occur or not
> occur given the host and cards combination. This is probably about
> GPU/PCI bridge combination (and motherboard chipset when it makes sense)
> or things like that.
> 
> Alex Deucher said:
> 
>> For newer AGP hardware like the RV730 you point out (or anything newer
> than R300), there is no reason to run AGP mode. The on chip GART is far
> superior.
> 
> So, on paper, AGP-as-PCI is expected to work, and on paper again, some
> of those card may even work better this way. Experience currently
> displays the exact opposite, which not only means there is bugs
> somewhere, but also that the behaviour of the PCI code is unpredictable,
> because predictions fail.
> 
> Christian Köning said:
> 
>> We simply don't have the time to support that older GPU and disabling
> AGP fixed quite a number of them.
> 
> Was disabling AGP motivated by some issues with identified causes and it
> was decided to not fix them, or was disabling AGP motivated by the
> observation it fixed some other issues but without identifying the causes?
> 
> What's now interesting is that on some PCIe-compatible platforms, PCIe
> is broken and AGP is the working fallback, and now that AGP is disabled
> by default in code, none work out of the box.
> 
> I can understand how it would be easier to not support older hardware,
> but on the other hand what's the purpose of Mesa/RadeonSI supporting
> them on the userland side if the kernel can't host the hardware to begin
> with?
> 
> Also, that may be seen as unfortunate, but AGP is not only about Rage
> 128 cards or those very very old thing that would not fullfill current
> needs. Unlike Nvidia, there was AMD/ATI AGP hardware that were produced
> and sold very lately and those are still capable to fullfill current
> needs. At the same time, AMD ensured very good compatibility of it's
> hardware, that's why it was possible to have the quoted quad core AM3
> Phenom II on an AM2 motherboard with AGP for example. This is precisely
> why AMD is appreciated by customers, not like Intel with frustrating
> market segmentation where, for example (real use case), one Pentium
> E5200 with IGP can support OpenGL 3 but not virtualization, while
> another Pentium E5200 with IGP cannot support OpenGL 3 but supports
> virtualization, or (another real example), supporting PAE while hiding
> it to the operating system. Buying AMD is all about not having to choose
> between this or this feature, and buying AMD is all about being able to
> get hardware that works over multiple hardware generations.
> 
> But anyway, outside of those considerations, it now appears the PCI code
> has serious issues and the behaviour can't be predicted. Newer hardware
> may be working, but do we know how much luck is involved?
> 
> I may be busy, doing those extra tests and reporting the results took me
> some extra months, but at least, I have access to a wide range of
> hardware to test any patch that would aim to fix the
> PCI/AGP/PCIe-related bugs. I would be happy to help on that topic. AGP
> is just one aspect of it, now we know those PCI-related bugs affect
> legacy PCI and PCI express as well.
> 
> Christian Köning said:
> 
>> We simply can't invest time maintaining a technology which is
> deprecated for nearly 15 years now.
> 
> It now appears that the bugs not only affect AGP and PCI but also PCI
> Express. AGP disclosed those bugs, but PCI seems to be at fault there.
> 
> One interesting thing is that some ATI/AMD cards on AMD hosts are more
> buggy than the same ATI/AMD cards on Intel hosts. The underlying bugs
> may even not be related to the cards themselves but on the host PCI code
> (chipsets, PCI bridges or things like that).
> 
> Note: one interesting thing is that I have access to two Radeon HD 4670,
> one AGP model, one PCIe model, from the same vendor, exact same
> generation, vendor and model, just one being AGP and one being PCIe
> variant. On the same Intel-based motherboard supporting both AGP and
> PCIe, only the AGP model works. The PCIe model is not faulty, it works
> as expected on AMD-based motherboards only having an AGP port and no
> PCIe port. Getting things working seems to be about luck, not about what
> the implementation is said to do.
> 
> I also have access to two X1950 pro, one PCIe, one AGP, from different
> vendors, though this one is less interesting because not a TeraScale
> one. But this may be useful for testing because I can test both on the
> same motherboard having both an AGP and a PCIe slot. Currently, only AGP
> works on that host anyway because when using AMD/ATI or Nvidia PCIe GPUs
> on that Intel host I reproduce the issues I get with AMD/ATI or Nvidia
> PCI  GPUs on AMD host…
> 
> Who are the ones working on the PCI platform code? Maybe those would be
> better interlocutors, it looks like the issue is not AMD specific, it
> affects Nvidia GPUs and Intel platforms as well.
> 
> Is there options similar to radeon.agpmode but for PCI / PCI Express I
> can experiment with?
> 
> I'll build one day the latest vanilla kernel to reproduce the issues and
> probably open a ticket on the kernel bugzilla regarding PCI-related
> problems in general (even if there is AMD-specific variants of the bug).
> That would be a good start.
> 
> Thank your very much for your attention, best regards,
> 
> PS: I wish to be personally CC'ed the answers/comments posted to the
> list in response to my posting.
> 
> --
> Thomas “illwieckz” Debesse
> Le 09/11/2020 à 18:37, Deucher, Alexander a écrit :
>> [AMD Public Use]
>>
>>> -----Original Message-----
>>> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
>>> Sent: Monday, November 9, 2020 6:41 AM
>>> To: LKML <linux-kernel@vger.kernel.org>
>>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander 
>>> <Alexander.Deucher@amd.com>
>>> Subject: On disabling AGP without working alternative (PCI fallback is 
>>> broken for years)
>>>
>>> Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP in 
>>> default build.
>>>
>>> It was signed-off by Christian König and Reviewed by Alex Deucher.
>>> Distributions started to backport this commit, and it seems to have 
>>> happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was 
>>> built on Sep 10 2020.
>>>
>>> Around that time I noticed AGP computers experiencing lock-ups and 
>>> other problems making them unusable after the upgrade. After 
>>> investigating what was happening bisecting Linux versions, I reverted 
>>> the commit and those computers were working again.
>>>
>>> Commit message was:
>>>
>>>> This means a performance regression for some GPUs, but also a bug 
>>>> fix for some others.
>>>
>>> Unfortunately, this commit does not only introduce a performance 
>>> regression but makes some computers unusable, maybe all computers with 
>>> AMD CPUs.
>>>
>>> One of the root cause may be that PCI GPUs are broken for years on AMD 
>>> platforms, it was tested and verified on:
>>>
>>> - K8-based computer with AGP
>>> - K8-based computer with PCI Express
>>> - K10-based computer with AGP
>>> - Piledriver-based computer with PCI Express
>>>
>>> The breakage was tested and reproduced from Linux 4.4 to Linux
>>> 5.10-rc2 (I have not tried older than 4.4).
>>>
>>> PCI GPUs may be broken on some other platforms, but I have found that 
>>> testing on an Intel PC (with PCI Express) does not reproduce the issue 
>>> when the PCI GPU hardware is plugged in.
>>>
>>> There is two patches I'm requesting comments for:
>>>
>>> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
>>>
>>> https://lkml.org/lkml/2020/11/5/307
>>>
>>> This one is not enough to fix PCI GPUs but it is enough to prevent to 
>>> fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs 
>>> can't be fixed by this, and this uncovers other bug with AGP GPUs when 
>>> AGP is disabled at build time. Also, this patch may makes PCI GPUs 
>>> working on a non-optimal way on platform that accepts them with 40-bit 
>>> DMA bit mask (like Intel- based computers that already work without any patch).
>>>
>>> This patch is inspired from the patch made to solve that issue from
>>> 2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375
>>>
>>> At the time, such change may have been enough to fix the issue, it's 
>>> not true any more. More breakage may have been introduced since.
>>>
>>> Also, maybe this patch becomes useless when other PCI bugs are fixed, 
>>> who knows? At least, this is an entry-point for investigations.
>>
>> I think you may be seeing fallout from this patch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33b3ad3788aba846fc8b9a065fe2685a0b64f713
>> That patch lead to screen corruption and other issues on older radeons.  It seemed to be related to AGP and/or HIMEM.  Disabling either of those fixes the issues.
>> I proposed reverting the change, but there was push back to find the root cause:
>> https://www.spinics.net/lists/stable/msg413960.html
>>
>>
>>>
>>> ## Revert "drm/radeon: disable AGP by default"
>>>
>>> https://lkml.org/lkml/2020/11/5/308
>>>
>>> This is the simple fix but currently only solution to make AMD hosts 
>>> with AGP port to get a display again, as without this reverts, those 
>>> computers do not have any alternative to run a display (even not PCI GPUs).
>>>
>>> I'm asking for comments on those patches. I may have reached my own 
>>> skill cap on kernel development anyway. I can repurpose hardware to 
>>> test any other patch and can contribute time for such testing. Unlike 
>>> AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time 
>>> and availability offered.
>>>
>>> The PCI GPU on AMD CPU issue was verified with both Nvidia (GS 8400GS
>>> rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU sample not being 
>>> old cards from the previous millennial but capable
>>> ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98 on 
>>> Nvidia side. They can both do OpenGL 3.3 and feature both 512M of 
>>> VRAM. The ATI one had HDMI port, and it is known some variant of the 
>>> Nvidia one (not the one I own but same specification) had HDMI port too.
>>>
>>> Also, fixing PCI GPUs may not be enough to fix AGP GPUs running as PCI 
>>> ones, since fixing some issues (not all) on PCI side raises new issues 
>>> with AGP GPUs running as PCI ones but not on native PCI GPUs (see below).
>>>
>>> Bugs aside, one thing that is important to consider against the AGP 
>>> disablement is that there is such hardware that is very capable and 
>>> not that old out there. For example the ATI Radeon HD 4670 AGP
>>> (RV730 XT) was still sold brand new after 2010 and is a powerful and 
>>> featureful GPUs with 1GB of VRAM and HDMI port. Performance with it is 
>>> still pretty decent on competitive games. To compare with other
>>>  open source drivers mainlined in Linux, to outperform this GPU an
>>>  user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.
>>>
>>> Also, yet another thing that is important to consider against AGP 
>>> disablement is that if PCI Express was introduced in 2004, there was 
>>> still AGP compatible hardware being designed, produced and sold very 
>>> lately, especially on AMD side. Computers with quad core 64-bit CPUs 
>>> with virtualisation, 16GB of RAM and AGPs exist, and this is widely 
>>> distributed consumer hardware, not specific esoteric hardware.
>>>
>>> So, not only powerful AGP GPUs were still sold brand new in the 
>>> current decade, but there was also very capable computers to host 
>>> them. Because of those AGP computers, fixing PCI GPUs fallback is not 
>>> a solution because PCI fallback is not a solution.
>>>
>>
>> For newer AGP hardware like the RV730 you point out (or anything newer than R300), there is no reason to run AGP mode.  The on chip GART is far superior.  The only chips where performance may be a problem is the older R1xx/R2xx radeons, and the issue there is more around the size of the TLB on the on chip GART vs the TLB in the AGP bridge. Also as Christian mentioned, AGP is PCI so if PCI doesn't work, you have bigger problems.
>>
>> Alex
>>
>>
>>> All that range of hardware became unusable with that commit disabling 
>>> AGP, without alternative.
>>>
>>> Not only those AGP GPUs don't work with kernel's PCI fallback, but 
>>> unplugging those AGP GPUs and plugging physical PCI-native GPUs 
>>> instead does not work.
>>>
>>> You'll find more details about the various issues on those bugs, I've 
>>> invested multiple full time day to test and reproduce bugs on a wide 
>>> range of hardware, I've attached, quoted and commented a lot of logs:
>>>
>>> - https://bugs.launchpad.net/bugs/1899304
>>>> AGP disablement leaves GPUs without working alternative (PCI 
>>>> fallback is broken), makes very-capable ATI TeraScale GPUs unusable
>>>
>>> - https://bugs.launchpad.net/bugs/1902981
>>>> AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
>>>> time) are known to fail on K8 and K10 platforms
>>>
>>> - https://bugs.launchpad.net/bugs/1902795
>>>> PCI graphics broken on AMD K8/K10/Piledriver platform (while it 
>>>> works on Intel) verified from Linux 4.4 to 5.10-rc2
>>>
>>> I wish to be personally CC'ed the answers/comments posted to the list 
>>> in response to my posting.
>>>
>>> Thank you for your attention.
>>>
>>> --
>>> Thomas “illwieckz” Debesse

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: On disabling AGP without working alternative (PCI and PCIe are also affected)
  2021-05-13 18:42     ` Thomas “illwieckz“ Debesse
@ 2021-05-13 19:02       ` Deucher, Alexander
  2021-05-13 20:46         ` Thomas “illwieckz“ Debesse
  0 siblings, 1 reply; 8+ messages in thread
From: Deucher, Alexander @ 2021-05-13 19:02 UTC (permalink / raw)
  To: Thomas “illwieckz“ Debesse, Koenig, Christian, LKML

[AMD Public Use]

> -----Original Message-----
> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
> Sent: Thursday, May 13, 2021 2:42 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; LKML <linux-kernel@vger.kernel.org>
> Subject: Re: On disabling AGP without working alternative (PCI and PCIe are
> also affected)
> 
> Erratum: In my previous e-mail I forgot to remove that sentence about Intel
> platform:
> 
> > If I'm right, AGP cards seems to work as PCI ones as well. That may
> > makes sense because previously I did not test AGP Radeon cards on
> > Intel platforms.
> 
> I first wrote that after having beeing fooled by a lucky hardware combination.
> It exchanged immediate failure at desktop load for later computer reboot at
> game load (but with successfully loaded desktop, hence the illusion it worked
> at first).
> 
> The valid statement, after more testing, is (quoted from the previous mail
> which had then contradictory statements):
> 
> > Note: on some very old cards (like the 9700 Pro from 2002) with
> > radeon.agpmode -1 (AGP as PCI), the symptom is different, the desktop
> > loads properly, but loading a game (Unvanquished) leads to a GPU
> > lockup.
> > While on newer hardware like ATI Radeon HD 4670 AGP (which was still
> > sold as new in 2012), the desktop won't load, displaying the exact
> > same symptoms as ATI/AMD PCI on ATI/AMD CPU. If I'm right, the old
> > ones like the 9700 Pro is likely to be a native AGP card, while the
> > latest ones like the HD 4670 may be natively PCIe with a bridge on the
> > AGP card.
> > Maybe that can ring a bell…
> 
> Not only the GPU can lockup, but the computer may also crash and reboot in
> that scenario. The same hardware and software combination works perfectly
> when AGP is enabled.
> 
> Those AMD/ATI AGP cards are then known to be affected on both AMD and
> Intel host platforms.

I don't think I have a functional AGP system anymore, but I do have PCIe capable systems and they work fine.  
Does this patch[1], help by any chance?  The change to add support for root ports with addressing limitations seemed to break a lot of old systems,
but never really got resolved.  If not, your best bet is probably to try and bisect if something broke your system(s).

Alex

[1] - https://www.spinics.net/lists/amd-gfx/msg52961.html

> 
> Best regards,
> 
> PS: I wish to be personally CC'ed the answers/comments posted to the list in
> response to my posting.
> 
> --
> Thomas “illwieckz” Debesse
> 
> Le 06/05/2021 à 07:37, Thomas “illwieckz“ Debesse a écrit :
> > Hi! First, thank you both Alex and Christian for your answers. Since
> > that time I did more experiments with more hardware. And… the new
> > knowledge is that the bug affects PCIe cards on PCIe slot when the
> > host also offers AGP slot, meaning those computers have to rely on AGP
> > cards (or PCI ones) instead of PCIe ones to get a working and usable
> desktop.
> > This bug is older than the AGP disablement that happened with 5.9-rc1
> > (out of curiosity I also reproduced the bug on Linux 4.15 from Ubuntu
> > Xenial, for example).
> >
> > We now know the issue affects both AGP, PCI, and PCIe. We know that
> > the issue affects both ATI/AMD and Nvidia hardware. The bugs occur
> > given this or that host / card combination, while everything works
> > with other combinations of the same hosts and cards (all of them being
> > validated to be working at some point). We know that, for PCI-related
> > issues, it even affects post-AGP AMD platforms like Bulldozer/Piledriver.
> >
> > The PCIe issue affects both ATI/AMD and Nvidia GPUs and the symptoms
> > are consistent with the ones experienced with legacy PCI ATI/AMD and
> > Nvidia GPUs on platform we know PCI GPUs do not work. When I talk
> > about GPUs with legacy PCI port I talk about TeraScale/Tesla ones with
> > GL3 support, 512MB of VRAM and HDMI support, not ones from the Rage
> > era, just to make sure this is clear enough.
> >
> > Among the PCIe GPUs I managed to reproduce the bug with, I can name:
> >
> > - ATI Radeon HD 5870 Eyefinity 6 (RV870 Cypress XT, TeraScale 2,
> > OpenGL 4.3, 2GB of VRAM, 6 miniDP, released 2010-03).
> > - AMD FirePro 3D V4800 (RV830 Redwood XT GL, Terascale2, OpenGL 3.3,
> > 1GB VRAM, 1 DVI-I + 2 DP, released 2010-04).
> > - AMD Radeon HD 6970 (RV910 Caicos, TeraScale 3, OpenGL 4.3, 2GB VRAM,
> > 2 DVI-D + 1 HDMI + 2 miniDP, released 2010-12).
> >
> > Those GPUs don't have PCI or AGP counterparts anyway. And we know
> some
> > Linux PCI bugs can even affect TeraScale 3 PCIe GPUs with OpenGL 4 and
> > OpenCL support.
> >
> > We also know some Nvidia GPU (PCI and PCIe) can be affected.
> >
> > It looks like the problems are less on the GPU side, but more on the
> > PCI platform (bridge? chipset? I don't know). So while AMD is more
> > affected, and maybe AMD specific code is affected since AMD symptoms
> > are different than Nvidia symptoms, even combinations to reproduce
> > bugs may differ given the brand, my attention is now focused on PCI.
> >
> > ## What we already knew at the time of previous emails ##
> >
> > 1. PCI ATI/AMD Radeon and PCI Nvidia GPUs don't work on AMD platform,
> > this was verified on K8 (AMD Athlon 64 3200+, AMD Athlon 64 X2), K10
> > (AMD Phenom II X4 970 8 core) and Piledriver (FX-9590 8 core), so we
> > can probably assume all AMD64 platforms before Ryzen, Ryzen-based
> > motherboard usually don't have legacy PCI anyway (I've heard of the
> > Biostar Racing X470GTA motherboard but well, it's more rare than
> > legacy PCI and AGP).
> >
> > So there are known issues with legacy PCI on pre-Ryzen architecture,
> > including bulldozer ones, which is far more recent than AGP.
> >
> > At the same time, those PCI ATI/AMD Radeon and PCI Nvidia GPUs work
> on
> > Intel platforms, this was verified on multiple hosts including ones
> > with Pentium E5200 dual core (with Intel 82801) and Core 2 Quad Q6600
> > (with VIA PT880/VT82xx) to name somes.
> >
> > The brokenness of PCI ATI/AMD Radeon GPU on AMD platform is
> unrelated
> > to the platform offering AGP or PCIe. Both K8 with AGP and K8 with
> > PCIe behaves the same when a PCI ATI/AMD Radeon GPU is plugged in.
> The
> > PCI Radeon sample used for the test is a Radeon HD 4350 which is
> > fairly
> > capable: TeraScale, GL 3.3, 512M VRAM, HDMI.
> >
> > 2. AGP Radeon cards stopped to work on AMD platforms when AGP was
> > disabled by default in 5.9-rc1. The only way to make them work is to
> > use
> > radeon.agpmode=1 kernel command line option. Because distributions
> > like Ubuntu LTS distributed the patch backported on 5.8 kernel (either
> > because they backported it themselves or kernel developers did it
> > upstream, I don't know), after the update the computers were not able
> > to complete the boot because they never reached the desktop. This
> > affected pretty capable computers like the one I previously quoted,
> > which runs the quad core AMD AM3 Phenom II CPU X4 970 (3.5GHz) with
> > 16GB of RAM and featuring AGP Radeon HD 4600 (TeraScale, GL 3.3, 1GB
> VRAM, HDMI).
> >
> > If I use startx with lxsession I can get a working X.org environement,
> > but that is super slow and not really usable. Very slow disk IO are
> > reported and audio glitches are experienced, this has side-effects on
> > a wider scale than the sole display. Starting a more complexe
> > environment like GNOME will just make the computer unusable.
> >
> > 3. AGP Radeon cards running as PCI cards on AMD platforms display the
> > same broken behaviour we can see with PCI Radeon cards, which is not
> > surprising given it is expected they would run the same way. At least
> > this prediction got verified. ATI/AMD AGP as PCI on AMD platform is as
> > broken as ATI/AMD PCI on AMD platform.
> >
> > 4. Some problems were said to have been noticed by kernel developers
> > with latest AMD Radeon hardware and it was said disabling AGP improved
> > the support for those recent cards, that's why AGP was disabled
> > starting with 5.9-rc1 in hope to fix the latest AMD Radeon hardware
> > (but that brokes older ones in the process).
> >
> > ## What is new knowledge since that time ##
> >
> > 1. PCIe ATI/AMD Radeon GPUs running on Intel host having both AGP and
> > PCIe slots do not work and displays the same broken behaviour we can
> > see with AGP Radeon cards running as PCI cards, or PCI Radeon cards on
> > AMD platform.
> >
> > This was verified with a wide range of AMD/ATI PCIe GPUs, both
> > consumer Radeon or professionnal FirePro cards.
> >
> > Interestingly, the testbed is an Intel-based platform (Core 2 Quad
> > Q6600) and then PCI Radeon cards work as we seen with other Intel
> > based hosts. If I'm right, AGP cards seems to work as PCI ones as
> > well. That may makes sense because previously I did not test AGP
> > Radeon cards on Intel platforms.
> >
> > But then, as I said, the PCIe Radeon cards just fails as AGP ones
> > running as PCI and PCI ones on AMD platforms. But for them,
> > radeon.agpmode=1 does not make sense, so there is no solution.
> >
> > By failing I mean grub displays things correctly, then linux displays
> > things correctly, including framebuffer, but when X.org starts and
> > open the desktop, some background is painted but the desktop never
> > complete the startup. The mouse pointer can be moved with the mouse
> > but that's all. The tested desktop is GNOME, and the shell itself does not
> display.
> >
> > So, with all those experiments done, and with all that knowledge, it
> > appears there is some serious issues in the PCI code.
> >
> > Note: on some very old cards (like the 9700 Pro from 2002) with
> > radeon.agpmode -1 (AGP as PCI), the symptom is different, the desktop
> > loads properly, but loading a game (Unvanquished) leads to a GPU lockup.
> > While on newer hardware like ATI Radeon HD 4670 AGP (which was still
> > sold as new in 2012), the desktop won't load, displaying the exact
> > same symptoms as ATI/AMD PCI on ATI/AMD CPU. If I'm right, the old
> > ones like the 9700 Pro is likely to be a native AGP card, while the
> > latest ones like the HD 4670 may be natively PCIe with a bridge on the AGP
> card.
> > Maybe that can ring a bell…
> >
> > 2. PCIe Ndidia GPUs on Intel host having both AGP and PCIe slots do
> > not work and displays the same broken behaviour we can see with PCI
> > Nvidia card running on AMD host. The graphical glitches are exactly
> > the sames with Nvidia PCI on the AMD host and Nvidia PCIe on the Intel
> > host (but not the same as PCI and AGP-as-PCI and PCIe Radeon
> symptoms).
> >
> > So, we can reproduce the Nvidia-specific glitches with both PCI and
> > PCIe, and we can reproduce the ATI/AMD-specific symptoms with both
> > PCI, AGP, and PCIe.
> >
> > ## Various answers and questions ##
> >
> > Christian Köning said:
> >
> >> That is interesting but doesn't make much sense from the technical
> > perspective.
> >> See AGP is build on top of PCI, if PCI doesn't work AGP won't work
> > either. So why should AGP work while PCI doesn't?
> >
> > Now we know that both PCI, AGP and PCIe are affected. Which makes
> sense.
> >
> > What's makes hard to track the bugs is that the bugs may occur or not
> > occur given the host and cards combination. This is probably about
> > GPU/PCI bridge combination (and motherboard chipset when it makes
> > sense) or things like that.
> >
> > Alex Deucher said:
> >
> >> For newer AGP hardware like the RV730 you point out (or anything
> >> newer
> > than R300), there is no reason to run AGP mode. The on chip GART is
> > far superior.
> >
> > So, on paper, AGP-as-PCI is expected to work, and on paper again, some
> > of those card may even work better this way. Experience currently
> > displays the exact opposite, which not only means there is bugs
> > somewhere, but also that the behaviour of the PCI code is
> > unpredictable, because predictions fail.
> >
> > Christian Köning said:
> >
> >> We simply don't have the time to support that older GPU and disabling
> > AGP fixed quite a number of them.
> >
> > Was disabling AGP motivated by some issues with identified causes and
> > it was decided to not fix them, or was disabling AGP motivated by the
> > observation it fixed some other issues but without identifying the causes?
> >
> > What's now interesting is that on some PCIe-compatible platforms, PCIe
> > is broken and AGP is the working fallback, and now that AGP is
> > disabled by default in code, none work out of the box.
> >
> > I can understand how it would be easier to not support older hardware,
> > but on the other hand what's the purpose of Mesa/RadeonSI supporting
> > them on the userland side if the kernel can't host the hardware to
> > begin with?
> >
> > Also, that may be seen as unfortunate, but AGP is not only about Rage
> > 128 cards or those very very old thing that would not fullfill current
> > needs. Unlike Nvidia, there was AMD/ATI AGP hardware that were
> > produced and sold very lately and those are still capable to fullfill
> > current needs. At the same time, AMD ensured very good compatibility
> > of it's hardware, that's why it was possible to have the quoted quad
> > core AM3 Phenom II on an AM2 motherboard with AGP for example. This is
> > precisely why AMD is appreciated by customers, not like Intel with
> > frustrating market segmentation where, for example (real use case),
> > one Pentium
> > E5200 with IGP can support OpenGL 3 but not virtualization, while
> > another Pentium E5200 with IGP cannot support OpenGL 3 but supports
> > virtualization, or (another real example), supporting PAE while hiding
> > it to the operating system. Buying AMD is all about not having to
> > choose between this or this feature, and buying AMD is all about being
> > able to get hardware that works over multiple hardware generations.
> >
> > But anyway, outside of those considerations, it now appears the PCI
> > code has serious issues and the behaviour can't be predicted. Newer
> > hardware may be working, but do we know how much luck is involved?
> >
> > I may be busy, doing those extra tests and reporting the results took
> > me some extra months, but at least, I have access to a wide range of
> > hardware to test any patch that would aim to fix the
> > PCI/AGP/PCIe-related bugs. I would be happy to help on that topic. AGP
> > is just one aspect of it, now we know those PCI-related bugs affect
> > legacy PCI and PCI express as well.
> >
> > Christian Köning said:
> >
> >> We simply can't invest time maintaining a technology which is
> > deprecated for nearly 15 years now.
> >
> > It now appears that the bugs not only affect AGP and PCI but also PCI
> > Express. AGP disclosed those bugs, but PCI seems to be at fault there.
> >
> > One interesting thing is that some ATI/AMD cards on AMD hosts are more
> > buggy than the same ATI/AMD cards on Intel hosts. The underlying bugs
> > may even not be related to the cards themselves but on the host PCI
> > code (chipsets, PCI bridges or things like that).
> >
> > Note: one interesting thing is that I have access to two Radeon HD
> > 4670, one AGP model, one PCIe model, from the same vendor, exact same
> > generation, vendor and model, just one being AGP and one being PCIe
> > variant. On the same Intel-based motherboard supporting both AGP and
> > PCIe, only the AGP model works. The PCIe model is not faulty, it works
> > as expected on AMD-based motherboards only having an AGP port and no
> > PCIe port. Getting things working seems to be about luck, not about
> > what the implementation is said to do.
> >
> > I also have access to two X1950 pro, one PCIe, one AGP, from different
> > vendors, though this one is less interesting because not a TeraScale
> > one. But this may be useful for testing because I can test both on the
> > same motherboard having both an AGP and a PCIe slot. Currently, only
> > AGP works on that host anyway because when using AMD/ATI or Nvidia
> > PCIe GPUs on that Intel host I reproduce the issues I get with AMD/ATI
> > or Nvidia PCI  GPUs on AMD host…
> >
> > Who are the ones working on the PCI platform code? Maybe those would
> > be better interlocutors, it looks like the issue is not AMD specific,
> > it affects Nvidia GPUs and Intel platforms as well.
> >
> > Is there options similar to radeon.agpmode but for PCI / PCI Express I
> > can experiment with?
> >
> > I'll build one day the latest vanilla kernel to reproduce the issues
> > and probably open a ticket on the kernel bugzilla regarding
> > PCI-related problems in general (even if there is AMD-specific variants of
> the bug).
> > That would be a good start.
> >
> > Thank your very much for your attention, best regards,
> >
> > PS: I wish to be personally CC'ed the answers/comments posted to the
> > list in response to my posting.
> >
> > --
> > Thomas “illwieckz” Debesse
> > Le 09/11/2020 à 18:37, Deucher, Alexander a écrit :
> >> [AMD Public Use]
> >>
> >>> -----Original Message-----
> >>> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
> >>> Sent: Monday, November 9, 2020 6:41 AM
> >>> To: LKML <linux-kernel@vger.kernel.org>
> >>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander
> >>> <Alexander.Deucher@amd.com>
> >>> Subject: On disabling AGP without working alternative (PCI fallback
> >>> is broken for years)
> >>>
> >>> Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP in
> >>> default build.
> >>>
> >>> It was signed-off by Christian König and Reviewed by Alex Deucher.
> >>> Distributions started to backport this commit, and it seems to have
> >>> happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was
> >>> built on Sep 10 2020.
> >>>
> >>> Around that time I noticed AGP computers experiencing lock-ups and
> >>> other problems making them unusable after the upgrade. After
> >>> investigating what was happening bisecting Linux versions, I
> >>> reverted the commit and those computers were working again.
> >>>
> >>> Commit message was:
> >>>
> >>>> This means a performance regression for some GPUs, but also a bug
> >>>> fix for some others.
> >>>
> >>> Unfortunately, this commit does not only introduce a performance
> >>> regression but makes some computers unusable, maybe all computers
> >>> with AMD CPUs.
> >>>
> >>> One of the root cause may be that PCI GPUs are broken for years on
> >>> AMD platforms, it was tested and verified on:
> >>>
> >>> - K8-based computer with AGP
> >>> - K8-based computer with PCI Express
> >>> - K10-based computer with AGP
> >>> - Piledriver-based computer with PCI Express
> >>>
> >>> The breakage was tested and reproduced from Linux 4.4 to Linux
> >>> 5.10-rc2 (I have not tried older than 4.4).
> >>>
> >>> PCI GPUs may be broken on some other platforms, but I have found
> >>> that testing on an Intel PC (with PCI Express) does not reproduce
> >>> the issue when the PCI GPU hardware is plugged in.
> >>>
> >>> There is two patches I'm requesting comments for:
> >>>
> >>> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
> >>>
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flk
> >>>
> ml.org%2Flkml%2F2020%2F11%2F5%2F307&amp;data=04%7C01%7CAlexand
> er.Deu
> >>>
> cher%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3dd8961fe48
> 84e60
> >>>
> 8e11a82d994e183d%7C0%7C0%7C637565282618094458%7CUnknown%7CTW
> FpbGZsb3
> >>>
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D
> >>>
> %7C2000&amp;sdata=ttmqSG4h28zhGya9KRnHkK7ftzmHVdrGsZDvTFiuX10%
> 3D&amp
> >>> ;reserved=0
> >>>
> >>> This one is not enough to fix PCI GPUs but it is enough to prevent
> >>> to fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
> >>> can't be fixed by this, and this uncovers other bug with AGP GPUs
> >>> when AGP is disabled at build time. Also, this patch may makes PCI
> >>> GPUs working on a non-optimal way on platform that accepts them with
> >>> 40-bit DMA bit mask (like Intel- based computers that already work
> without any patch).
> >>>
> >>> This patch is inspired from the patch made to solve that issue from
> >>> 2012 on kernel 3.5:
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbu
> >>>
> gzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D785375&amp;data=04%7C01%7
> CAl
> >>>
> exander.Deucher%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7
> C3dd89
> >>>
> 61fe4884e608e11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnkn
> own%7
> >>>
> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> LCJ
> >>>
> XVCI6Mn0%3D%7C2000&amp;sdata=7uF7JHP9LQnD6yJsZBhT9Tzwo%2BoM2
> bnaxxXKk
> >>> jpRfbw%3D&amp;reserved=0
> >>>
> >>> At the time, such change may have been enough to fix the issue, it's
> >>> not true any more. More breakage may have been introduced since.
> >>>
> >>> Also, maybe this patch becomes useless when other PCI bugs are
> >>> fixed, who knows? At least, this is an entry-point for investigations.
> >>
> >> I think you may be seeing fallout from this patch:
> >>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> >>
> .kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git
> >>
> %2Fcommit%2F%3Fid%3D33b3ad3788aba846fc8b9a065fe2685a0b64f713&am
> p;data
> >>
> =04%7C01%7CAlexander.Deucher%40amd.com%7C24731393d8ef426b573e0
> 8d9163e
> >>
> d4ae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6375652826183
> 13502%
> >>
> 7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLC
> JBTiI6I
> >>
> k1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=FYFOtp1DYw1RU%2BxdJcV
> NNpyk5NMr
> >> GPjDSEFiM3kvx8k%3D&amp;reserved=0 That patch lead to screen
> >> corruption and other issues on older radeons.  It seemed to be related to
> AGP and/or HIMEM.  Disabling either of those fixes the issues.
> >> I proposed reverting the change, but there was push back to find the root
> cause:
> >>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w
> >>
> .spinics.net%2Flists%2Fstable%2Fmsg413960.html&amp;data=04%7C01%7CA
> le
> >>
> xander.Deucher%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3
> dd8961
> >>
> fe4884e608e11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnkno
> wn%7CTW
> >>
> FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI
> >>
> 6Mn0%3D%7C2000&amp;sdata=XPVqb5fLOb7os1uz6sizBCKNiI2OfaNzYjy5fkq
> XlqM%
> >> 3D&amp;reserved=0
> >>
> >>
> >>>
> >>> ## Revert "drm/radeon: disable AGP by default"
> >>>
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flk
> >>>
> ml.org%2Flkml%2F2020%2F11%2F5%2F308&amp;data=04%7C01%7CAlexand
> er.Deu
> >>>
> cher%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3dd8961fe48
> 84e60
> >>>
> 8e11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnknown%7CTW
> FpbGZsb3
> >>>
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D
> >>>
> %7C2000&amp;sdata=LEOPG4tHCHYtehnG3orDvb0FBEwMBd%2FYkhhpAvBo
> nTM%3D&a
> >>> mp;reserved=0
> >>>
> >>> This is the simple fix but currently only solution to make AMD hosts
> >>> with AGP port to get a display again, as without this reverts, those
> >>> computers do not have any alternative to run a display (even not PCI
> GPUs).
> >>>
> >>> I'm asking for comments on those patches. I may have reached my own
> >>> skill cap on kernel development anyway. I can repurpose hardware to
> >>> test any other patch and can contribute time for such testing.
> >>> Unlike AGP GPUs, PCI GPUs are hard to find, so you may appreciate
> >>> the time and availability offered.
> >>>
> >>> The PCI GPU on AMD CPU issue was verified with both Nvidia (GS
> >>> 8400GS
> >>> rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU sample not being
> >>> old cards from the previous millennial but capable
> >>> ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98 on
> >>> Nvidia side. They can both do OpenGL 3.3 and feature both 512M of
> >>> VRAM. The ATI one had HDMI port, and it is known some variant of the
> >>> Nvidia one (not the one I own but same specification) had HDMI port
> too.
> >>>
> >>> Also, fixing PCI GPUs may not be enough to fix AGP GPUs running as
> >>> PCI ones, since fixing some issues (not all) on PCI side raises new
> >>> issues with AGP GPUs running as PCI ones but not on native PCI GPUs
> (see below).
> >>>
> >>> Bugs aside, one thing that is important to consider against the AGP
> >>> disablement is that there is such hardware that is very capable and
> >>> not that old out there. For example the ATI Radeon HD 4670 AGP
> >>> (RV730 XT) was still sold brand new after 2010 and is a powerful and
> >>> featureful GPUs with 1GB of VRAM and HDMI port. Performance with it
> >>> is still pretty decent on competitive games. To compare with other
> >>>  open source drivers mainlined in Linux, to outperform this GPU an
> >>>  user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.
> >>>
> >>> Also, yet another thing that is important to consider against AGP
> >>> disablement is that if PCI Express was introduced in 2004, there was
> >>> still AGP compatible hardware being designed, produced and sold very
> >>> lately, especially on AMD side. Computers with quad core 64-bit CPUs
> >>> with virtualisation, 16GB of RAM and AGPs exist, and this is widely
> >>> distributed consumer hardware, not specific esoteric hardware.
> >>>
> >>> So, not only powerful AGP GPUs were still sold brand new in the
> >>> current decade, but there was also very capable computers to host
> >>> them. Because of those AGP computers, fixing PCI GPUs fallback is
> >>> not a solution because PCI fallback is not a solution.
> >>>
> >>
> >> For newer AGP hardware like the RV730 you point out (or anything newer
> than R300), there is no reason to run AGP mode.  The on chip GART is far
> superior.  The only chips where performance may be a problem is the older
> R1xx/R2xx radeons, and the issue there is more around the size of the TLB on
> the on chip GART vs the TLB in the AGP bridge. Also as Christian mentioned,
> AGP is PCI so if PCI doesn't work, you have bigger problems.
> >>
> >> Alex
> >>
> >>
> >>> All that range of hardware became unusable with that commit
> >>> disabling AGP, without alternative.
> >>>
> >>> Not only those AGP GPUs don't work with kernel's PCI fallback, but
> >>> unplugging those AGP GPUs and plugging physical PCI-native GPUs
> >>> instead does not work.
> >>>
> >>> You'll find more details about the various issues on those bugs,
> >>> I've invested multiple full time day to test and reproduce bugs on a
> >>> wide range of hardware, I've attached, quoted and commented a lot of
> logs:
> >>>
> >>> -
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbu
> >>>
> gs.launchpad.net%2Fbugs%2F1899304&amp;data=04%7C01%7CAlexander.D
> euch
> >>>
> er%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3dd8961fe4884
> e608e
> >>>
> 11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnknown%7CTWFp
> bGZsb3d8
> >>>
> eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7
> >>>
> C2000&amp;sdata=Kh1pP0j%2BNGZL9tXcCz8kU3rqhvfS%2BRfCJ8HX12%2Bg
> Mf0%3D
> >>> &amp;reserved=0
> >>>> AGP disablement leaves GPUs without working alternative (PCI
> >>>> fallback is broken), makes very-capable ATI TeraScale GPUs unusable
> >>>
> >>> -
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbu
> >>>
> gs.launchpad.net%2Fbugs%2F1902981&amp;data=04%7C01%7CAlexander.D
> euch
> >>>
> er%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3dd8961fe4884
> e608e
> >>>
> 11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnknown%7CTWFp
> bGZsb3d8
> >>>
> eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7
> >>>
> C2000&amp;sdata=Oe0vd4XdPqiqwkUYAY9UGNkZx%2BWfweCMgnkIbn5Gx
> q0%3D&amp
> >>> ;reserved=0
> >>>> AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
> >>>> time) are known to fail on K8 and K10 platforms
> >>>
> >>> -
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbu
> >>>
> gs.launchpad.net%2Fbugs%2F1902795&amp;data=04%7C01%7CAlexander.D
> euch
> >>>
> er%40amd.com%7C24731393d8ef426b573e08d9163ed4ae%7C3dd8961fe4884
> e608e
> >>>
> 11a82d994e183d%7C0%7C0%7C637565282618313502%7CUnknown%7CTWFp
> bGZsb3d8
> >>>
> eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7
> >>>
> C2000&amp;sdata=ddgbHRV99Xike%2F4weyiQmH9gfcmHXl21yusU9jdw6QA
> %3D&amp
> >>> ;reserved=0
> >>>> PCI graphics broken on AMD K8/K10/Piledriver platform (while it
> >>>> works on Intel) verified from Linux 4.4 to 5.10-rc2
> >>>
> >>> I wish to be personally CC'ed the answers/comments posted to the
> >>> list in response to my posting.
> >>>
> >>> Thank you for your attention.
> >>>
> >>> --
> >>> Thomas “illwieckz” Debesse

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: On disabling AGP without working alternative (PCI and PCIe are also affected)
  2021-05-13 19:02       ` Deucher, Alexander
@ 2021-05-13 20:46         ` Thomas “illwieckz“ Debesse
  2021-05-13 21:18           ` Deucher, Alexander
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas “illwieckz“ Debesse @ 2021-05-13 20:46 UTC (permalink / raw)
  To: Deucher, Alexander, Koenig, Christian, LKML

Le 13/05/2021 à 21:02, Deucher, Alexander a écrit :
> [AMD Public Use]
> 
> I don't think I have a functional AGP system anymore, but I do have PCIe capable systems and they work fine.  
> Does this patch[1], help by any chance?  The change to add support for root ports with addressing limitations seemed to break a lot of old systems,
> but never really got resolved.  If not, your best bet is probably to try and bisect if something broke your system(s).
> 
> Alex
> 
> [1] - https://www.spinics.net/lists/amd-gfx/msg52961.html

The more modern PCIe systems seems to not be affected when running PCIe
cards. I would not be surprised if modern PCIe hosts rely on features
that were supported in the past and then, the old features are not
really tested.

For example while reading the Linux code in October I noticed the code
was referencing different mask lenght, what if only the implementation
for the newer length works or something like that?

But well, the patch you linked is touching the exact code that made me
wondering about it:

```
	dma_bits = 40;
	if (rdev->flags & RADEON_IS_AGP)
		dma_bits = 32;
	if ((rdev->flags & RADEON_IS_PCI) &&
	    (rdev->family <= CHIP_RS740))
		dma_bits = 32;
```

If I'm right this code sets this value to 40 by default, then sets it to
32 if GPU is AGP or if GPU is PCI and identifier is smaller or equal to
RS740.

I see no RADEON_IS_PCIE so I assume both PCIe and AGP cards running with
radeon.agpmode -1 with identifiers greater than RS740 are probably
keeping this value as 40.

It's interesting to notice the PCI HD 4350 (RV710) will use 40 bits,
given it is after RS740 in drivers/gpu/drm/radeon/radeon_family.h

If an AGP card is running with radeon.agpmode = -1, how is it reported,
RADEON_IS_AGP or RADEON_IS_PCI?

If RADEON_IS_PCI, the AGP Radeon HD4670 (RV730) will use 40 bits, given
it is after RS740 in drivers/gpu/drm/radeon/radeon_family.h

I had some memories of having tried to force everything to 32 in that
part of the code, but then, I still got problems but different ones.

From https://lkml.org/lkml/2020/11/9/1054:

> ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
>
> https://lkml.org/lkml/2020/11/5/307
>
> This one is not enough to fix PCI GPUs but it is enough to prevent
> to fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
> can't be fixed by this, and this uncovers other bug with AGP GPUs when
> AGP is disabled at build time. Also, this patch may makes PCI GPUs
> working on a non-optimal way on platform that accepts them with 40-bit
> DMA bit mask (like Intel-based computers that already work without any
> patch).

So I was wondering if there was a similar issue elsewhere in the code.

I see the patch at https://www.spinics.net/lists/amd-gfx/msg52961.html
is also setting in that code another variable I haven't touched:
rdev->need_dma32

I'll try to set both dma_bits to 32 and rdev->need_dma32 unconditionally
and see if I notice a difference with this or that GPU.

Note that the issue with the PCI HD 4350 (RV710) does not need an AGP
host to be tested, only an AMD host (reproduced from K8 to Piledriver),
but unfortunately now the PCI variant of this card seems to be very hard
to find (I doubt the PCIe one is affected).

Thank you for your answer and you attention!

-- 
Thomas “illwieckz” Debesse
I wish to be personally CC'ed the answers/comments posted to the list in
response to my posting.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: On disabling AGP without working alternative (PCI and PCIe are also affected)
  2021-05-13 20:46         ` Thomas “illwieckz“ Debesse
@ 2021-05-13 21:18           ` Deucher, Alexander
  0 siblings, 0 replies; 8+ messages in thread
From: Deucher, Alexander @ 2021-05-13 21:18 UTC (permalink / raw)
  To: Thomas “illwieckz“ Debesse, Koenig, Christian, LKML

[AMD Public Use]

> -----Original Message-----
> From: Thomas “illwieckz“ Debesse <dev@illwieckz.net>
> Sent: Thursday, May 13, 2021 4:46 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; LKML <linux-kernel@vger.kernel.org>
> Subject: Re: On disabling AGP without working alternative (PCI and PCIe are
> also affected)
> 
> Le 13/05/2021 à 21:02, Deucher, Alexander a écrit :
> > [AMD Public Use]
> >
> > I don't think I have a functional AGP system anymore, but I do have PCIe
> capable systems and they work fine.
> > Does this patch[1], help by any chance?  The change to add support for
> > root ports with addressing limitations seemed to break a lot of old systems,
> but never really got resolved.  If not, your best bet is probably to try and
> bisect if something broke your system(s).
> >
> > Alex
> >
> > [1] -
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w.
> > spinics.net%2Flists%2Famd-
> gfx%2Fmsg52961.html&amp;data=04%7C01%7CAlexa
> >
> nder.Deucher%40amd.com%7Cec9ae4ac2229473707a708d916502bf7%7C3dd
> 8961fe4
> >
> 884e608e11a82d994e183d%7C0%7C0%7C637565356504234517%7CUnknown
> %7CTWFpbG
> >
> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> Mn0%
> >
> 3D%7C3000&amp;sdata=PcvPmesLXI26VqPct8hdPxQzxC%2BqY4wFtbTYvwjw
> 6eM%3D&a
> > mp;reserved=0
> 
> The more modern PCIe systems seems to not be affected when running
> PCIe cards. I would not be surprised if modern PCIe hosts rely on features
> that were supported in the past and then, the old features are not really
> tested.
> 
> For example while reading the Linux code in October I noticed the code was
> referencing different mask lenght, what if only the implementation for the
> newer length works or something like that?
> 
> But well, the patch you linked is touching the exact code that made me
> wondering about it:
> 
> ```
> 	dma_bits = 40;
> 	if (rdev->flags & RADEON_IS_AGP)
> 		dma_bits = 32;
> 	if ((rdev->flags & RADEON_IS_PCI) &&
> 	    (rdev->family <= CHIP_RS740))
> 		dma_bits = 32;
> ```
> 
> If I'm right this code sets this value to 40 by default, then sets it to
> 32 if GPU is AGP or if GPU is PCI and identifier is smaller or equal to RS740.
> 
> I see no RADEON_IS_PCIE so I assume both PCIe and AGP cards running with
> radeon.agpmode -1 with identifiers greater than RS740 are probably keeping
> this value as 40.
> 
> It's interesting to notice the PCI HD 4350 (RV710) will use 40 bits, given it is
> after RS740 in drivers/gpu/drm/radeon/radeon_family.h
> 
> If an AGP card is running with radeon.agpmode = -1, how is it reported,
> RADEON_IS_AGP or RADEON_IS_PCI?

It depends on the asic.  See radeon_agp_disable(), but it doesn't really matter.  The driver doesn't really care, it's all PCI at the end of the day.
The only thing the driver really cares about is whether it will be using the AGP remapper in the chipset for accessing system memory, or whether it will be
using its own built in remapper on the GPU itself.

> 
> If RADEON_IS_PCI, the AGP Radeon HD4670 (RV730) will use 40 bits, given it
> is after RS740 in drivers/gpu/drm/radeon/radeon_family.h
> 
> I had some memories of having tried to force everything to 32 in that part of
> the code, but then, I still got problems but different ones.

The bits here refer to the addressing capabilities of the device.  How many address bits can they handle for DMA.  It's baked into the hardware.  Device drivers
report the address limits of the device to the kernel so that DMA API will give them memory within the range of addresses they can access.

> 
> From
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.
> org%2Flkml%2F2020%2F11%2F9%2F1054&amp;data=04%7C01%7CAlexander.
> Deucher%40amd.com%7Cec9ae4ac2229473707a708d916502bf7%7C3dd8961f
> e4884e608e11a82d994e183d%7C0%7C0%7C637565356504234517%7CUnknow
> n%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> WwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=BUZ8dBLzzltv0N2RTSwdz%2BlT5
> cQgopYgdropej2FINE%3D&amp;reserved=0:
> 
> > ## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask
> >
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml
> >
> .org%2Flkml%2F2020%2F11%2F5%2F307&amp;data=04%7C01%7CAlexander.
> Deucher
> >
> %40amd.com%7Cec9ae4ac2229473707a708d916502bf7%7C3dd8961fe4884e6
> 08e11a8
> >
> 2d994e183d%7C0%7C0%7C637565356504234517%7CUnknown%7CTWFpbGZs
> b3d8eyJWIj
> >
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000&am
> >
> p;sdata=1ihVkGgLeFq9IXWzXMMxQHGFhllG5RvgPQ%2BOkZY6dq8%3D&amp
> ;reserved=
> > 0
> >
> > This one is not enough to fix PCI GPUs but it is enough to prevent to
> > fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
> > can't be fixed by this, and this uncovers other bug with AGP GPUs when
> > AGP is disabled at build time. Also, this patch may makes PCI GPUs
> > working on a non-optimal way on platform that accepts them with 40-bit
> > DMA bit mask (like Intel-based computers that already work without any
> > patch).
> 
> So I was wondering if there was a similar issue elsewhere in the code.

Note that platforms can also impose limitations on DMA even if a device may be more capable.  That is what Christoph's patch attempted to address.
The patch you proposed above more or less a partial revert of the same patch I referenced in my last reply.  

Alex

> 
> I see the patch at
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w.spinics.net%2Flists%2Famd-
> gfx%2Fmsg52961.html&amp;data=04%7C01%7CAlexander.Deucher%40amd.
> com%7Cec9ae4ac2229473707a708d916502bf7%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637565356504234517%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D%7C3000&amp;sdata=PcvPmesLXI26VqPct8hdPxQzxC%2BqY4wFtbTYvwj
> w6eM%3D&amp;reserved=0
> is also setting in that code another variable I haven't touched:
> rdev->need_dma32
> 
> I'll try to set both dma_bits to 32 and rdev->need_dma32 unconditionally and
> see if I notice a difference with this or that GPU.
> 
> Note that the issue with the PCI HD 4350 (RV710) does not need an AGP host
> to be tested, only an AMD host (reproduced from K8 to Piledriver), but
> unfortunately now the PCI variant of this card seems to be very hard to find
> (I doubt the PCIe one is affected).
> 
> Thank you for your answer and you attention!
> 
> --
> Thomas “illwieckz” Debesse
> I wish to be personally CC'ed the answers/comments posted to the list in
> response to my posting.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-13 21:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 11:40 On disabling AGP without working alternative (PCI fallback is broken for years) Thomas “illwieckz“ Debesse
2020-11-09 13:57 ` Christian König
2020-11-09 17:37 ` Deucher, Alexander
2021-05-06  5:37   ` On disabling AGP without working alternative (PCI and PCIe are also affected) Thomas “illwieckz“ Debesse
2021-05-13 18:42     ` Thomas “illwieckz“ Debesse
2021-05-13 19:02       ` Deucher, Alexander
2021-05-13 20:46         ` Thomas “illwieckz“ Debesse
2021-05-13 21:18           ` Deucher, Alexander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).