All of lore.kernel.org
 help / color / mirror / Atom feed
* IP27: debugging RCU stalls under newer code
@ 2017-10-10  0:39 Joshua Kinard
  2017-10-16  5:32 ` Joshua Kinard
  0 siblings, 1 reply; 2+ messages in thread
From: Joshua Kinard @ 2017-10-10  0:39 UTC (permalink / raw)
  To: Linux/MIPS

[-- Attachment #1: Type: text/plain, Size: 7996 bytes --]

Hi,

I am still trying to chase down bugs in the modified IP27 patchset I've been
maintaining.  Getting a rather consistent RCU hang the last few kernel
versions, but I have not had much luck figuring out why it's hanging.  I
suspect the re-written interrupt code I did has a bug in it somewhere.

Here's the most common symptom on the Origin 200 machine that I have:

[    5.093177] NET: Registered protocol family 1
[    5.145708] RPC: Registered named UNIX socket transport module.
[    5.216044] RPC: Registered udp transport module.
[    5.272604] RPC: Registered tcp transport module.
[    5.329150] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  128.087259] INFO: rcu_sched detected stalls on CPUs/tasks:
[  128.151416]  (detected by 1, t=11887 jiffies, g=-261, c=-262, q=3)
[  128.225797] All QSes seen, last rcu_sched kthread activity 11885 (4294949888-4294938003), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  128.369292] rcu_sched       R  running task        0     8      2 0x00100000
[  128.411349] workingset: timestamp_bits=40 max_order=15 bucket_order=0
[  128.531651] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
[  128.628013]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
[  128.724381]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
[  128.820749]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
[  128.917116]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
[  129.013484]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
[  129.109852]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  129.206220]         0000000000000000 a8000000700cfc20 0000000000000000 a8000000000a227c
[  129.302587]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
[  129.398954]         00000000ffff8d93 a800000000029350 a8000000007cf700 a8000000000a227c
[  129.495322]         ...
[  129.524655] Call Trace:
[  129.554030] [<a800000000029350>] show_stack+0x70/0x9c
[  129.614757] [<a8000000000a227c>] rcu_check_callbacks+0x85c/0x944
[  129.687036] [<a8000000000a6644>] update_process_times+0x34/0x70
[  129.758252] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
[  129.825291] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
[  129.898615] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
[  129.967747] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
[  130.042120] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
[  130.119630] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
[  130.194007] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
[  130.262085] [<a800000000092198>] generic_handle_irq+0x28/0x38
[  130.331218] [<a8000000000257f0>] do_IRQ+0x18/0x28
[  130.387781] [<a8000000000237d0>] handle_int+0x160/0x16c
[  130.450644] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
[  130.523964] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
[  130.602514] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
[  130.667458] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
[  130.734511] [<a800000000067b20>] kthread+0x178/0x180
[  130.794204] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
[  130.867538] rcu_sched kthread starved for 11885 jiffies! g18446744073709551355 c18446744073709551354 f0x2 RCU_GP_DOING_FQS(4) ->state=0x0
[  131.016256] rcu_sched       R  running task        0     8      2 0x00100000
[  131.101100] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
[  131.197468]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
[  131.293835]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
[  131.390203]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
[  131.486570]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
[  131.582937]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
[  131.679305]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  131.775673]         0000000000000000 a8000000700cfc00 0000000000000000 a8000000000a2c3c
[  131.872040]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
[  131.968408]         fffffffffffffefa a800000000029350 a8000000007cf700 a8000000000a2c3c
[  132.064776]         ...
[  132.094106] Call Trace:
[  132.123446] [<a800000000029350>] show_stack+0x70/0x9c
[  132.184200] [<a8000000000a2c3c>] rcu_check_gp_kthread_starvation+0xdc/0x104
[  132.267997] [<a8000000000a2110>] rcu_check_callbacks+0x6f0/0x944
[  132.340276] [<a8000000000a6644>] update_process_times+0x34/0x70
[  132.411500] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
[  132.478541] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
[  132.551865] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
[  132.620996] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
[  132.695367] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
[  132.772881] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
[  132.847252] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
[  132.915336] [<a800000000092198>] generic_handle_irq+0x28/0x38
[  132.984469] [<a8000000000257f0>] do_IRQ+0x18/0x28
[  133.041033] [<a8000000000237d0>] handle_int+0x160/0x16c
[  133.103882] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
[  133.177208] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
[  133.255764] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
[  133.320709] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
[  133.387749] [<a800000000067b20>] kthread+0x178/0x180
[  133.447455] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
[  133.521008] zbud: loaded

What's the best way to try and debug this?  I've done some research on RCU
debugging, but that's quite a rabbit hole, so I need to narrow down the
possible causes before figuring out which hole to venture down.  The
interesting observation in almost all test cases going back to ~4.11 is
the stall for almost 2-3 minutes before the RCU core "kicked" things to get
moving again.

It'll sometimes stall again, especially when probing PCI.  If you wait long
enough, though, the machine eventually resumes.  E.g., from this same dmesg,
I had an even longer stall at the end of kernel init, but when it resumed,
there was no complaining from the RCU subsystem:

[  173.755793] sd 0:0:4:0: [sdd] Attached SCSI disk
[  174.389121] sd 0:0:6:0: Attached scsi generic sg5 type 0
[  174.390132] sd 0:0:6:0: [sdf] 35563040 512-byte logical blocks: (18.2 GB/17.0 GiB)
[  174.391646] sd 0:0:6:0: [sdf] Write Protect is off
[  174.394708] sd 0:0:6:0: [sdf] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  174.423225]  sdf: sdf1 sdf2 sdf9 sdf11
[  174.431433] sd 0:0:6:0: [sdf] Attached SCSI disk
[ 1942.572077] st 1:0:2:0: Attached scsi tape st0
[ 1942.625796] st 1:0:2:0: st0: try direct i/o: yes (alignment 4 B)
[ 1942.699105] st 1:0:2:0: Attached scsi generic sg6 type 1
[ 1942.766034] sr 1:0:6:0: [sr0] scsi-1 drive
[ 1942.815484] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 1942.879869] sr 1:0:6:0: Attached scsi generic sg7 type 5
[ 1942.967801] Freeing unused kernel memory: 3840K
[ 1943.022480] This architecture does not have kernel memory protection.

That's a stall of almost 29 minutes, then it suddenly woke up and finished
booting into the netboot image I have for debugging.  I'd think under normal
circumstances, the machine should have panicked long before then.  Any ideas
to try would be appreciated.  A full bootlog is attached.

Thanks!,

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

[-- Attachment #2: ip27-boot-4135-20171009.txt --]
[-- Type: text/plain, Size: 26885 bytes --]

IP27 PROM SGI Version 6.156  built 11:27:56 AM Nov 18, 2003
Local master CPU A revision: e35
PROM length: 0xdddd0, BSS length: 0xb772, flash count: 7
Configured hub clock: 90.0 MHz
Status of Hub IO: 0x1 0x3fc03ff2c0a
Hub Rev: 5, Module: 1 from MSC, Slot: 1
On PROM entry: ERR_EPC=0xffffffffbfc00ee0 (0xc00000001fc00ee0)
Configuring memory
Bank 0: 512 MB (standard)
Bank 1: 512 MB (standard)
Bank 2: 512 MB (standard)
Bank 3: 512 MB (standard)
Local memory configured: 2048 MB (standard)
*** Diag level set to None (2)
*** Info level set to verbose
*** Hub NIC broken; using BaseIO nic
Hub NIC: 0x5242cf
SR1 set to 0x0100019c06010000
SR0 set to 0x00000000005242cf
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             Init PROM text/data (0x9600000001a00000), len 0xde000
        Initializing dir/prot
        Clearing memory
Copy PROM (0x9000000010000000) to RAM (0x9600000001a00000), len 0xdddd0
Done
DONE
CPU A switching to UALIAS
CPU A now running out of UALIAS
Init PROM bss (0x9600000001ade000), len 0xc000
        Initializing dir/prot
        Clearing memory
Init PROM stack/structures (0x9600000001bd0000), len 0x30000
        Initializing dir/prot
        Clearing memory
Done
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Init bank 0 before text (0xa800000000000000), len 0x1a00000
        Initializing dir/prot
        Clearing memory
Init bank 0 end of BSS to structs (0xa800000001aea000), len 0xe6000
        Initializing dir/prot
        Clearing memory
Init bank 0 end of structs to 32 MB (0xa800000001c00000), len 0x400000
        Initializing dir/prot
        Clearing memory
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000004000 size c000
Done initializing klconfig.
Discovering local IO ......................             *** Local Master IO6 is on widget 0x8
DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30376 usec
Waiting for peers to complete discovery....             DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5242cf, /hw/module/1/slot/MotherBoard
Global master is /hw/module/1/slot/MotherBoard
Global barrier (line 4418)Global barrier passed.
Global barrier (line 4446)Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5242cf)
    NASID=-1 Mod=1 Slot=1 Flg=0x100000 PROM=6.156 Route=N/A
    MODULE=1 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5242cf)
    NASID=0 Mod=1 Slot=1 Flg=0x100000 PROM=6.156 Route=N/A
    MODULE=1 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
CPU A Flushing and invalidating caches
Changing node ID to 0
Global barrier (line 4908)Global barrier passed.
Global barrier (line 4981)Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Testing/Initializing all memory ....Init bank 0 after 32 MB (0xa800000002000000), len 0x1e000000
        Initializing dir/prot
        Clearing memory
.Init bank 1 (0xa800000020000000), len 0x20000000
        Initializing dir/prot
        Clearing memory
.Init bank 2 (0xa800000040000000), len 0x20000000
        Initializing dir/prot
        Clearing memory
.Init bank 3 (0xa800000060000000), len 0x20000000
        Initializing dir/prot
        Clearing memory
....            DONE
Global barrier (line 5108)Global barrier passed.
Checking partitioning information .........             DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5242cf)
    NASID=0 Mod=1 Slot=1 Flg=0x100000 PROM=6.156 Route=N/A
    MODULE=1 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
Entering ust_change
Global barrier (line 5355)Global barrier passed.
Loading BASEIO prom .......................             Uncompressing segment BASEIO prom
Jumping to entry point 0xa800000001c00140

DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:05:48 AM Nov 18, 2003 (BE64)
2 CPUs on 1 nodes found.
Installing PROM Device drivers ............
Base I/O Ethernet set to /dev/ethernet/ef0

Walking SCSI Adapter 0 (/hw/module/1/slot/MotherBoard), (pci id 0)
1+ 2+ 3+ 4+ 5+ 6+ 7- 8- 9- 10- 11- 12- 13- 14- 15- = 6 device(s)


Walking SCSI Adapter 1 (/hw/module/1/slot/MotherBoard), (pci id 1)
1- 2+ 3- 4- 5- 6+ 7- = 2 device(s)

Initializing PROM Device drivers ..........             DONE
Checking hardware inventory ...............              DONE

**** System Configuration and Diagnostics Summary ****
CONFIG:
         No. of NODEs enabled    = 1
         No. of NODEs disabled   = 0
         No. of CPUs enabled     = 2
         No. of CPUs disabled    = 0
         Mem enabled             = 2048 MB
         Mem disabled            = 0 MB
         No. of RTRs enabled     = 0
         No. of RTRs disabled    = 0

DIAG RESULTS:
         NO DIAGS WERE RUN!
**** End System Configuration and Diagnostics Summary ****


System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option? 5
Command Monitor.  Type "exit" to return to the menu.
>> bootp(): console=ttyS0,9600
Obtaining  from server
12446880+856864 entry: 0xa80000000062a4e0
[    0.000000] Linux version 4.13.5-mipsgit-20171007 (root@helcaraxe) (gcc version 7.2.0 (Gentoo 7.2.0 p1.1)) #2 SMP Mon Oct 9 17:33:11 EDT 2017
[    0.000000] ARCH: SGI-IP27
[    0.000000] PROMLIB: ARC firmware Version 64 Revision 0
[    0.000000] SMP: Discovered 2 cpus on 1 nodes
[    0.000000] ************** Topology ********************
[    0.000000]     00
[    0.000000] 00  255
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 00000e35 (R12000)
[    0.000000] FPU revision is: 00000900
[    0.000000] Checking for the multiply/shift bug... no.
[    0.000000] Checking for the daddiu bug... no.
[    0.000000] IP27: Running on node 0.
[    0.000000] IP27: Node 0 has a primary CPU, CPU is running.
[    0.000000] IP27: Node 0 has a secondary CPU, CPU is running.
[    0.000000] IP27: Machine is in M mode.
[    0.000000] IP27: CPU 1A (CPU0): 360 MHz CPU detected.
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 0000000000830000 @ 0000000000010000 (usable)
[    0.000000]  memory: 00000000003c0000 @ 0000000000840000 (usable after init)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] cma: Reserved 128 MiB at 0x0000000078000000
[    0.000000] REPLICATION: ON nasid 0, ktext from nasid 0, kdata from nasid 0
[    0.000000] Primary instruction cache 32kB, VIPT, 2-way, linesize 64 bytes.
[    0.000000] Primary data cache 32kB, 2-way, VIPT, no aliases, linesize 32 bytes
[    0.000000] Unified secondary cache 4096kB 2-way, linesize 128 bytes.
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] percpu: Embedded 2 pages/cpu @a800000001230000 s44192 r0 d86880 u131072
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32736
[    0.000000] Kernel command line: console=ttyS0,9600
[    0.000000] PID hash table entries: 4096 (order: -1, 32768 bytes)
[    0.000000] Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
[    0.000000] Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
[    0.000000] Memory: 1947200K/2097152K available (6232K kernel code, 713K rwdata, 1368K rodata, 3840K init, 815K bss, 18880K reserved, 131072K cma-reserved)
[    0.000000] random: get_random_u32 called from cache_alloc_refill+0x700/0xc94 with crng_init=0
[    0.000000] Hierarchical RCU implementation.
[    0.000000]  RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
[    0.000000] NR_IRQS: 256
[    0.000000] clocksource: HUB: mask: 0xfffffffffffff max_cycles: 0x127350b88, max_idle_ns: 1763180808480 ns
[    0.000024] sched_clock: 52 bits at 1250kHz, resolution 800ns, wraps every 4398046510800ns
[    0.102954] Console: colour dummy device 80x25
[    0.154959] Calibrating delay loop... 537.39 BogoMIPS (lpj=2686976)
[    0.288373] pid_max: default: 32768 minimum: 301
[    0.348906] Mount-cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.428314] Mountpoint-cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.520711] Checking for the daddi bug... no.
[    0.575051] Hierarchical SRCU implementation.
[    0.627084] smp: Bringing up secondary CPUs ...
[    0.681753] Primary instruction cache 32kB, VIPT, 2-way, linesize 64 bytes.
[    0.681775] Primary data cache 32kB, 2-way, VIPT, no aliases, linesize 32 bytes
[    0.681788] Unified secondary cache 4096kB 2-way, linesize 128 bytes.
[    0.683875] IP27: CPU 1B (CPU1): 360 MHz CPU detected.
[    0.683905] CPU1 revision is: 00000e35 (R12000)
[    0.683913] FPU revision is: 00000900
[    0.740838] smp: Brought up 1 node, 2 CPUs
[    1.199617] devtmpfs: initialized
[    1.238754] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    1.355983] futex hash table entries: 512 (order: 0, 65536 bytes)
[    1.431528] xor: measuring software checksum speed
[    1.587392]    8regs     :   518.400 MB/sec
[    1.735959]    8regs_prefetch:   492.800 MB/sec
[    1.888873]    32regs    :   505.600 MB/sec
[    2.037558]    32regs_prefetch:   505.600 MB/sec
[    2.091191] xor: using function: 8regs (518.400 MB/sec)
[    2.154855] NET: Registered protocol family 16
[    2.210917] cpuidle: using governor ladder
[    2.300934] cpuidle: using governor menu
[    2.346289] xtalk:n0/8 bridge widget (rev D) registered as a platform device.
[    2.691408] raid6: int64x1  gen()   247 MB/s
[    2.911346] raid6: int64x1  xor()    89 MB/s
[    3.071463] random: fast init done
[    3.131280] raid6: int64x2  gen()   315 MB/s
[    3.351292] raid6: int64x2  xor()   111 MB/s
[    3.571175] raid6: int64x4  gen()   305 MB/s
[    3.791108] raid6: int64x4  xor()   104 MB/s
[    4.011250] raid6: int64x8  gen()   212 MB/s
[    4.231037] raid6: int64x8  xor()    89 MB/s
[    4.280515] raid6: using algorithm int64x2 gen() 315 MB/s
[    4.345456] raid6: .... xor() 111 MB/s, rmw enabled
[    4.404115] raid6: using intx1 recovery algorithm
[    4.461166] SCSI subsystem initialized
[    4.507276] clocksource: Switched to clocksource HUB
[    4.566548] FS-Cache: Loaded
[    4.600344] CacheFiles: Loaded
[    4.647052] NET: Registered protocol family 2
[    4.698335] TCP established hash table entries: 16384 (order: 1, 131072 bytes)
[    4.785248] TCP bind hash table entries: 16384 (order: 2, 262144 bytes)
[    4.865407] TCP: Hash tables configured (established 16384 bind 16384)
[    4.942916] UDP hash table entries: 2048 (order: 0, 65536 bytes)
[    5.015264] UDP-Lite hash table entries: 2048 (order: 0, 65536 bytes)
[    5.093177] NET: Registered protocol family 1
[    5.145708] RPC: Registered named UNIX socket transport module.
[    5.216044] RPC: Registered udp transport module.
[    5.272604] RPC: Registered tcp transport module.
[    5.329150] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  128.087259] INFO: rcu_sched detected stalls on CPUs/tasks:
[  128.151416]  (detected by 1, t=11887 jiffies, g=-261, c=-262, q=3)
[  128.225797] All QSes seen, last rcu_sched kthread activity 11885 (4294949888-4294938003), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  128.369292] rcu_sched       R  running task        0     8      2 0x00100000
[  128.411349] workingset: timestamp_bits=40 max_order=15 bucket_order=0
[  128.531651] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
[  128.628013]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
[  128.724381]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
[  128.820749]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
[  128.917116]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
[  129.013484]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
[  129.109852]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  129.206220]         0000000000000000 a8000000700cfc20 0000000000000000 a8000000000a227c
[  129.302587]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
[  129.398954]         00000000ffff8d93 a800000000029350 a8000000007cf700 a8000000000a227c
[  129.495322]         ...
[  129.524655] Call Trace:
[  129.554030] [<a800000000029350>] show_stack+0x70/0x9c
[  129.614757] [<a8000000000a227c>] rcu_check_callbacks+0x85c/0x944
[  129.687036] [<a8000000000a6644>] update_process_times+0x34/0x70
[  129.758252] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
[  129.825291] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
[  129.898615] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
[  129.967747] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
[  130.042120] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
[  130.119630] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
[  130.194007] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
[  130.262085] [<a800000000092198>] generic_handle_irq+0x28/0x38
[  130.331218] [<a8000000000257f0>] do_IRQ+0x18/0x28
[  130.387781] [<a8000000000237d0>] handle_int+0x160/0x16c
[  130.450644] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
[  130.523964] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
[  130.602514] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
[  130.667458] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
[  130.734511] [<a800000000067b20>] kthread+0x178/0x180
[  130.794204] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
[  130.867538] rcu_sched kthread starved for 11885 jiffies! g18446744073709551355 c18446744073709551354 f0x2 RCU_GP_DOING_FQS(4) ->state=0x0
[  131.016256] rcu_sched       R  running task        0     8      2 0x00100000
[  131.101100] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
[  131.197468]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
[  131.293835]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
[  131.390203]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
[  131.486570]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
[  131.582937]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
[  131.679305]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  131.775673]         0000000000000000 a8000000700cfc00 0000000000000000 a8000000000a2c3c
[  131.872040]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
[  131.968408]         fffffffffffffefa a800000000029350 a8000000007cf700 a8000000000a2c3c
[  132.064776]         ...
[  132.094106] Call Trace:
[  132.123446] [<a800000000029350>] show_stack+0x70/0x9c
[  132.184200] [<a8000000000a2c3c>] rcu_check_gp_kthread_starvation+0xdc/0x104
[  132.267997] [<a8000000000a2110>] rcu_check_callbacks+0x6f0/0x944
[  132.340276] [<a8000000000a6644>] update_process_times+0x34/0x70
[  132.411500] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
[  132.478541] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
[  132.551865] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
[  132.620996] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
[  132.695367] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
[  132.772881] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
[  132.847252] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
[  132.915336] [<a800000000092198>] generic_handle_irq+0x28/0x38
[  132.984469] [<a8000000000257f0>] do_IRQ+0x18/0x28
[  133.041033] [<a8000000000237d0>] handle_int+0x160/0x16c
[  133.103882] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
[  133.177208] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
[  133.255764] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
[  133.320709] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
[  133.387749] [<a800000000067b20>] kthread+0x178/0x180
[  133.447455] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
[  133.521008] zbud: loaded
[  133.557968] FS-Cache: Netfs 'nfs' registered for caching
[  133.620895] NFS: Registering the id_resolver key type
[  133.680816] Key type id_resolver registered
[  133.731068] Key type id_legacy registered
[  133.779278] efs: 1.0a - http://aeschi.ch.eu.org/efs/
[  133.839023] SGI XFS with ACLs, security attributes, realtime, no debug enabled
[  133.948288] NET: Registered protocol family 38
[  134.000396] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[  134.089007] io scheduler noop registered
[  134.135990] io scheduler deadline registered (default)
[  134.197788] io scheduler mq-deadline registered
[  134.252271] io scheduler kyber registered
[  134.300577] io scheduler bfq registered
[  134.414549] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[  134.513376] loop: module loaded
[  134.550054] zram: Added device: zram0
[  134.593440] st: Version 20160209, fixed bufsize 32768, s/g segs 256
[  134.670019] PCI host bridge to bus 0000:00
[  134.717891] pci_bus 0000:00: root bus resource [mem 0x9200000008200000-0x92000000089fffff]
[  134.817384] pci_bus 0000:00: root bus resource [io  0x9200000008a00000-0x9200000008bfffff]
[  134.916882] pci_bus 0000:00: root bus resource [bus 00-ff]
[  134.984972] qla1280: QLA1040 found on PCI bus 0, dev 0
[  135.044892] PCI: Enabling device 0000:00:00.0 (0006 -> 0007)
[  135.113658] scsi(0): Enabling vchannel on BRIDGE for SGI/MIPS
[  135.631732] random: crng init done
[  135.910429] scsi(0:0): Resetting SCSI BUS
[  139.047615] scsi host0: QLogic QLA1040 PCI to SCSI Host Adapter
[  139.047615]        Firmware version:  7.65.06, Driver version 3.27.1
[  139.202724] qla1280: QLA1040 found on PCI bus 0, dev 1
[  139.262666] PCI: Enabling device 0000:00:01.0 (0006 -> 0007)
[  139.439149] scsi 0:0:1:0: Direct-Access     SEAGATE  SX150176LC       BA12 PQ: 0 ANSI: 2
[  139.536089] scsi(0:0:1:0):
[  139.536102]  Sync: period 10, offset 12
[  139.567101] , Wide
[  139.613192] , Tagged queuing: depth 31
[  139.637283]
[  139.712064] scsi 0:0:2:0: Direct-Access     SEAGATE  SX150176LC       BA08 PQ: 0 ANSI: 2
[  139.808993] scsi(0:0:2:0):
[  139.809005]  Sync: period 10, offset 12
[  139.840073] , Wide
[  139.886163] , Tagged queuing: depth 31
[  139.910253]
[  139.985472] scsi 0:0:3:0: Direct-Access     SEAGATE  SX150176LC       BA08 PQ: 0 ANSI: 2
[  140.063309] scsi(1:0): Resetting SCSI BUS
[  140.130319] scsi(0:0:3:0):
[  140.130331]  Sync: period 10, offset 12
[  140.161645] , Wide
[  140.207736] , Tagged queuing: depth 31
[  140.231827]
[  140.311662] scsi 0:0:4:0: Direct-Access     SEAGATE  SX150176LC       BA11 PQ: 0 ANSI: 2
[  140.408418] scsi(0:0:4:0):
[  140.408428]  Sync: period 10, offset 12
[  140.439646] , Wide
[  140.485736] , Tagged queuing: depth 31
[  140.509827]
[  140.591720] scsi 0:0:5:0: Direct-Access     SEAGATE  SX150176LC       BA11 PQ: 0 ANSI: 2
[  140.688494] scsi(0:0:5:0):
[  140.688504]  Sync: period 10, offset 12
[  140.719740] , Wide
[  140.765831] , Tagged queuing: depth 31
[  140.789921]
[  140.874909] scsi 0:0:6:0: Direct-Access     SGI      IBM  DGHS18Y     0190 PQ: 0 ANSI: 3
[  140.971650] scsi(0:0:6:0):
[  140.971660]  Sync: period 10, offset 12
[  141.002872] , Wide
[  141.048964] , Tagged queuing: depth 31
[  141.073055]
[  145.367397] scsi host1: QLogic QLA1040 PCI to SCSI Host Adapter
[  145.367397]        Firmware version:  7.65.06, Driver version 3.27.1
[  145.810686] ioc3: part: [], serial: [] => class IP27 BaseIO
[  145.880026] ioc3-eth: Ethernet address is 08:00:69:0d:c9:a2.
[  145.947668] IOC3 0000:00:02.0 eth0: link up, 100Mbps, full-duplex, lpa 0xCDE1
[  146.032163] eth0: Using PHY 31, vendor 0x20005c0, model 0, rev 1.
[  146.105495] eth0: IOC3 SSRAM has 128 kbyte.
[  146.157409] rtc-m48t35 rtc-m48t35: rtc core: registered m48t35 as rtc0
[  146.309864] 0000:00:02.0: ttyS0 at IOC3 0x8620178 (irq = 0, base_baud = 458333) is a 16550A
[  146.408694] console [ttyS0] enabled
[  146.491467] bootconsole [early0] disabled
[  146.491467] bootconsole [early0] disabled
[  146.610666] 0000:00:02.0: ttyS1 at IOC3 0x8620170 (irq = 0, base_baud = 458333) is a 16550A
[  146.711497] scsi 1:0:2:0: Sequential-Access EXABYTE  EXB-85058HE-0000 0112 PQ: 0 ANSI: 2
[  146.720941] IOC3 Master Driver loaded for 0000:00:02.0
[  146.721871] NET: Registered protocol family 10
[  146.723762] Segment Routing with IPv6
[  146.723858] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[  146.724891] NET: Registered protocol family 17
[  146.725028] 8021q: 802.1Q VLAN Support v1.8
[  146.725108] Key type dns_resolver registered
[  146.809143] rtc-m48t35 rtc-m48t35: setting system clock to 2017-10-09 21:36:26 UTC (1507584986)
[  47.302708] scsi(1:0:2:0):
[  147.302724]  Sync: period 10, offset 12
[  147.334248]
[  165.234496] scsi 1:0:6:0: CD-ROM            TOSHIBA  CD-ROM XM-6401TA 1003 PQ: 0 ANSI: 2
[  165.332816] scsi(1:0:6:0):
[  165.332828]  Sync: period 10, offset 12
[  165.365461]
[  172.248920] sd 0:0:1:0: Attached scsi generic sg0 type 0
[  172.314293] sd 0:0:2:0: Attached scsi generic sg1 type 0
[  172.316704] sd 0:0:2:0: [sdb] 97693755 512-byte logical blocks: (50.0 GB/46.6 GiB)
[  172.319011] sd 0:0:2:0: [sdb] Write Protect is off
[  172.321899] sd 0:0:2:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  172.388881] sd 0:0:1:0: [sda] 97693755 512-byte logical blocks: (50.0 GB/46.6 GiB)
[  172.391640] sd 0:0:1:0: [sda] Write Protect is off
[  172.394578] sd 0:0:1:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  172.415232]  sdb: sdb1 sdb2 sdb3 sdb4 sdb5 sdb6 sdb9 sdb11
[  172.418573]  sda: sda1 sda2 sda3 sda4 sda5 sda6 sda9 sda11
[  172.428482] sd 0:0:2:0: [sdb] Attached SCSI disk
[  172.428904] sd 0:0:1:0: [sda] Attached SCSI disk
[  173.131856] sd 0:0:3:0: Attached scsi generic sg2 type 0
[  173.133900] sd 0:0:3:0: [sdc] 97693755 512-byte logical blocks: (50.0 GB/46.6 GiB)
[  173.136198] sd 0:0:3:0: [sdc] Write Protect is off
[  173.139065] sd 0:0:3:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  173.165848]  sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc9 sdc11
[  173.176470] sd 0:0:3:0: [sdc] Attached SCSI disk
[  173.572309] sd 0:0:4:0: Attached scsi generic sg3 type 0
[  173.637640] sd 0:0:5:0: Attached scsi generic sg4 type 0
[  173.638418] sd 0:0:4:0: [sdd] 97693755 512-byte logical blocks: (50.0 GB/46.6 GiB)
[  173.639849] sd 0:0:5:0: [sde] 97693755 512-byte logical blocks: (50.0 GB/46.6 GiB)
[  173.641284] sd 0:0:4:0: [sdd] Write Protect is off
[  173.642627] sd 0:0:5:0: [sde] Write Protect is off
[  173.644230] sd 0:0:4:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  173.645497] sd 0:0:5:0: [sde] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  173.745090]  sdd: sdd1 sdd2 sdd3 sdd4 sdd5 sdd6 sdd9 sdd11
[  173.752544] sd 0:0:5:0: [sde] Attached SCSI disk
[  173.755793] sd 0:0:4:0: [sdd] Attached SCSI disk
[  174.389121] sd 0:0:6:0: Attached scsi generic sg5 type 0
[  174.390132] sd 0:0:6:0: [sdf] 35563040 512-byte logical blocks: (18.2 GB/17.0 GiB)
[  174.391646] sd 0:0:6:0: [sdf] Write Protect is off
[  174.394708] sd 0:0:6:0: [sdf] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  174.423225]  sdf: sdf1 sdf2 sdf9 sdf11
[  174.431433] sd 0:0:6:0: [sdf] Attached SCSI disk
[ 1942.572077] st 1:0:2:0: Attached scsi tape st0
[ 1942.625796] st 1:0:2:0: st0: try direct i/o: yes (alignment 4 B)
[ 1942.699105] st 1:0:2:0: Attached scsi generic sg6 type 1
[ 1942.766034] sr 1:0:6:0: [sr0] scsi-1 drive
[ 1942.815484] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 1942.879869] sr 1:0:6:0: Attached scsi generic sg7 type 5
[ 1942.967801] Freeing unused kernel memory: 3840K
[ 1943.022480] This architecture does not have kernel memory protection.



timer_entropyd v0.3, (C) 2009-2012 by folkert@vanheusden.com


>>> Generating ECDSA hostkey ...
Generating key, this may take a while...
Public key portion is:
ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAGcXOvuy/XiYGUxAvqgSqcip8YAnwanZhr1SuY8x4kFe7NF+Y3opbO9mTrGtn+ath3K2LzobRZ3+qgqdxihxqqXsgD+GTUp7tkJuKZVRPGUJhtNT8Vv8KTiz8GlXRYyHyeL91ub8kQLv9WCy9fWlbLQtmoN4FQArzk+9nTIqWxTcatGGA== root@mips-netboot-2017.07
Fingerprint: md5 ff:c9:d6:1e:c7:e8:58:70:3d:df:8e:97:04:b0:3c:52

>>> Generating RSA hostkey ...
Generating key, this may take a while...
Public key portion is:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCpjuxz7nboAFak8qay/F91iOGD5dycFvisbSDDvRAfparE0wTE+2vUp6P8f9qzYTfLgLxoLMX4xPM1VMmeFdTdxcU1twGnKx6s+42EE+O5cSFrXn48TmcsO4onTNf7JqFnka5Z6sd0LRhMIISDVgsTb5oHOA9yWs5IXHvy0p4197ohT0Y/P8fDnt3qxLN8VV/bytDbeqGzGEYnVncpT4799idfvsE/oh2X8iZgU64IX8/I4YBvIpZTDQwPrsYizKhj3UowYNo0hJMriEGPvfXfDN4Vhvti7jcE32kTRs5ypyQWVu7fK9DwPaCzWN5Zfk2KNXzpaAR0qyljIwI3y+0Z root@mips-netboot-2017.07
Fingerprint: md5 e8:cf:05:2f:53:44:c3:f9:07:ec:65:70:32:9d:42:52



Gentoo Linux; http://www.gentoo.org/
 Copyright 1999-2017 Gentoo Foundation; Distributed under the GPLv2

 Gentoo/MIPS Netboot for SGI Origin/Onyx2 Systems
 Build Date: July 29, 2017

 * To configure networking (eth0), do the following:

 * For Static IPv4:
 * /bin/net-setup.sh <IP Address> <Gateway Address>

 * For Dynamic IPv4:
 * /bin/net-setup.sh dhcp


/ #

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: IP27: debugging RCU stalls under newer code
  2017-10-10  0:39 IP27: debugging RCU stalls under newer code Joshua Kinard
@ 2017-10-16  5:32 ` Joshua Kinard
  0 siblings, 0 replies; 2+ messages in thread
From: Joshua Kinard @ 2017-10-16  5:32 UTC (permalink / raw)
  To: linux-mips

On 10/09/2017 20:39, Joshua Kinard wrote:
> Hi,
> 
> I am still trying to chase down bugs in the modified IP27 patchset I've been
> maintaining.  Getting a rather consistent RCU hang the last few kernel
> versions, but I have not had much luck figuring out why it's hanging.  I
> suspect the re-written interrupt code I did has a bug in it somewhere.
> 
> Here's the most common symptom on the Origin 200 machine that I have:
> 
> [    5.093177] NET: Registered protocol family 1
> [    5.145708] RPC: Registered named UNIX socket transport module.
> [    5.216044] RPC: Registered udp transport module.
> [    5.272604] RPC: Registered tcp transport module.
> [    5.329150] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [  128.087259] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  128.151416]  (detected by 1, t=11887 jiffies, g=-261, c=-262, q=3)
> [  128.225797] All QSes seen, last rcu_sched kthread activity 11885 (4294949888-4294938003), jiffies_till_next_fqs=1, root ->qsmask 0x0
> [  128.369292] rcu_sched       R  running task        0     8      2 0x00100000
> [  128.411349] workingset: timestamp_bits=40 max_order=15 bucket_order=0
> [  128.531651] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
> [  128.628013]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
> [  128.724381]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
> [  128.820749]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
> [  128.917116]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
> [  129.013484]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
> [  129.109852]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  129.206220]         0000000000000000 a8000000700cfc20 0000000000000000 a8000000000a227c
> [  129.302587]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
> [  129.398954]         00000000ffff8d93 a800000000029350 a8000000007cf700 a8000000000a227c
> [  129.495322]         ...
> [  129.524655] Call Trace:
> [  129.554030] [<a800000000029350>] show_stack+0x70/0x9c
> [  129.614757] [<a8000000000a227c>] rcu_check_callbacks+0x85c/0x944
> [  129.687036] [<a8000000000a6644>] update_process_times+0x34/0x70
> [  129.758252] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
> [  129.825291] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
> [  129.898615] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
> [  129.967747] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
> [  130.042120] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
> [  130.119630] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
> [  130.194007] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
> [  130.262085] [<a800000000092198>] generic_handle_irq+0x28/0x38
> [  130.331218] [<a8000000000257f0>] do_IRQ+0x18/0x28
> [  130.387781] [<a8000000000237d0>] handle_int+0x160/0x16c
> [  130.450644] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
> [  130.523964] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
> [  130.602514] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
> [  130.667458] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
> [  130.734511] [<a800000000067b20>] kthread+0x178/0x180
> [  130.794204] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
> [  130.867538] rcu_sched kthread starved for 11885 jiffies! g18446744073709551355 c18446744073709551354 f0x2 RCU_GP_DOING_FQS(4) ->state=0x0
> [  131.016256] rcu_sched       R  running task        0     8      2 0x00100000
> [  131.101100] Stack : a800000000c2fe28 000000000000000a 000000000000000a a800000000c30232
> [  131.197468]         000000000000001d a800000000c30000 0000000000000000 0000000000000000
> [  131.293835]         0000000000000005 0000000000000000 000000000000000a a80000000008fe3c
> [  131.390203]         ffffffff94005ce0 0000000000000000 0000000000000000 a800000070195480
> [  131.486570]         a8000000007c0000 a8000000007c04f0 a8000000007c0000 0000000000000000
> [  131.582937]         a8000000007c0000 fffffffffffffefb 0000000000000000 0000000000000000
> [  131.679305]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  131.775673]         0000000000000000 a8000000700cfc00 0000000000000000 a8000000000a2c3c
> [  131.872040]         0000000000000000 0000000000000000 a800000070195480 a800000000072c00
> [  131.968408]         fffffffffffffefa a800000000029350 a8000000007cf700 a8000000000a2c3c
> [  132.064776]         ...
> [  132.094106] Call Trace:
> [  132.123446] [<a800000000029350>] show_stack+0x70/0x9c
> [  132.184200] [<a8000000000a2c3c>] rcu_check_gp_kthread_starvation+0xdc/0x104
> [  132.267997] [<a8000000000a2110>] rcu_check_callbacks+0x6f0/0x944
> [  132.340276] [<a8000000000a6644>] update_process_times+0x34/0x70
> [  132.411500] [<a8000000000b636c>] tick_sched_timer+0x3c/0x90
> [  132.478541] [<a8000000000a6c6c>] __hrtimer_run_queues+0x128/0x1c4
> [  132.551865] [<a8000000000a7654>] hrtimer_interrupt+0xf4/0x274
> [  132.620996] [<a80000000001e328>] ip27_hub_rt_compare_irq+0xa8/0xb8
> [  132.695367] [<a800000000092d7c>] __handle_irq_event_percpu+0xe0/0x1a4
> [  132.772881] [<a800000000092e60>] handle_irq_event_percpu+0x20/0x7c
> [  132.847252] [<a800000000097c34>] handle_percpu_irq+0xa8/0xd8
> [  132.915336] [<a800000000092198>] generic_handle_irq+0x28/0x38
> [  132.984469] [<a8000000000257f0>] do_IRQ+0x18/0x28
> [  133.041033] [<a8000000000237d0>] handle_int+0x160/0x16c
> [  133.103882] [<a8000000006294a8>] arch_local_irq_restore+0x18/0x20
> [  133.177208] [<a80000000062f8ac>] _raw_spin_unlock_irqrestore+0x2c/0x38
> [  133.255764] [<a8000000000a0418>] force_qs_rnp+0x12c/0x1e0
> [  133.320709] [<a8000000000a0dc0>] rcu_gp_kthread+0x8f4/0xa04
> [  133.387749] [<a800000000067b20>] kthread+0x178/0x180
> [  133.447455] [<a800000000023408>] ret_from_kernel_thread+0x14/0x1c
> [  133.521008] zbud: loaded

After a fair bit of experimentation over the weekend, it does indeed look like
the culprit for these RCU stalls is the early console serial driver spamming
the IOC3 UART registers at a slow baud rate.  IP27 prints early PROM messages
at 9600bps at poweron, thus not changing the baud rate very often.  Booting
with 38400bps, plus enabling CONFIG_RCU_FAST_NO_HZ and CONFIG_RCU_NOCB_CPU
together appear to significantly reduce the issue.  My Origin 200 can now boot
to a netboot prompt in ~24-38 seconds, where it normally takes upwards of 3-5
minutes, if everything happens to work right.  Next, have to test the Onyx2 out...

Likely not the correct fix, but it works for now.  Octane could also trip up an
RCU stall under the right conditions, but I made some changes to the IRQ
handling that appears to have eliminated those.  IP27 is just a much more
fickle animal with its main console being over a serial line.  Probably buried
deep in the RCU code and docs is a function call that can be added to places w/
tight looping to allow RCU updates to happen periodically, but that's an
experiment for another day.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-10-16  5:35 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-10  0:39 IP27: debugging RCU stalls under newer code Joshua Kinard
2017-10-16  5:32 ` Joshua Kinard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.