On Fri, Dec 6, 2019 at 5:08 PM Bjorn Helgaas wrote: > > On Fri, Dec 06, 2019 at 08:09:48AM +0200, Ranran wrote: > > On Fri, Nov 29, 2019 at 8:38 PM Bjorn Helgaas wrote: > > > > > > On Fri, Nov 29, 2019 at 06:10:51PM +0200, Ranran wrote: > > > > On Fri, Nov 29, 2019 at 4:58 PM Bjorn Helgaas wrote: > > > > > On Fri, Nov 29, 2019 at 06:59:48AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=205701 > > > I have tried to upgrade to latest kernel 5.4 (elrepo in centos), but > > with this processor/board (system x3650, Xeon), it get hang during > > kernel boot, without any error in dmesg, just keeps waiting for > > nothing for couple of minutes and than drops to dracut. > > - I don't think you ever said exactly what the original failure mode > was. You said DMA from an FPGA failed. What is the specific > device? How do you know the DMA fails? > Hi, FPGA is Intel's Arria 10 device. We know that DMA fails because on using signaltap/probing the DMA transaction from FPGA to CPU's RAM we see that it stall, i.e. keep waiting for the access to finish. We don't observe any error in dmesg. > - Re your v5.4 kernel testing, dracut is a user-space distro thing, so > it sounds like your hang is some sort of installation problem that I > can't really help you with. Maybe there are troubleshooting hints > at https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html. I know, that's quite frustrating. I tried to disable features using kernel arguments noacpi, noapic, but it still freeze somewhere without giving any error, > You may also be able to just drop a v5.4 kernel on your v4.18 > system, at least for testing purposes. > What does it mean to drop 5.4 kernel on 4.18 kernel ? > - Your comment #3 in bugzilla is a link to a Google Doc containing a > test module. In the future, please attach things as plain text > attachments directly to the bugzilla. There's an "Add attachment" > link immediately before the "Description" comment in bugzilla. I > did it for you this time. > > - It looks like your test_module.c is a kernel module, and frankly > it's a mess. Global variables that should be per-device, unused > variables (dma_get_mask() called for no reason), confused usage > (e.g., using both pci_dev_s and pPciDev), whitespace that appears > random, etc. I suggest starting with Documentation/PCI/pci.rst and, > at least for this debugging effort, making it a self-contained > driver instead of splitting things between a kernel module and > user-space. > I've attached latest kernel module, which I hope will make it more clear, I will try to make it a standalone test next time I'm in lab. > - Your comment #4 is a link to a Google Doc containing lspci output. > I attached it to bugzilla directly for you. > > - You apparently didn't run lspci as root ("sudo lspci -vv"), so it > is missing a lot of information. > > - Your lspci doesn't match either of the dmesg logs. Please make sure > all your logs are from the same machine in the same configuration. > For example, the first devices found by the kernel (from both > comments #1 and #2) are: > > pci 0000:00:00.0: [8086:3c00] type 00 class 0x060000 > pci 0000:00:01.0: [8086:3c02] type 01 class 0x060400 > pci 0000:00:02.0: [8086:3c04] type 01 class 0x060400 > pci 0000:00:02.2: [8086:3c06] type 01 class 0x060400 > ... > > But the lspci doesn't include 00:01.0, 00:02.0, or 00:02.2. It > shows: > > 00:00.0 Host bridge: Intel Corporation Device 2020 (rev 04) > 00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > 00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > 00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > ... I will do it in lab tomorrow. Thanks.