From: "Leyendecker, Robert" <Robert.Leyendecker@lsi.com>
To: rt-users <linux-rt-users@vger.kernel.org>
Subject: help needed, 2.6.31.6-rt19 hang with network user app
Date: Fri, 20 Nov 2009 15:56:12 -0500 [thread overview]
Message-ID: <8C8865ED624BB94F8FE50259E2B5C5B304593DAA9E@palmail03.lsi.com> (raw)
Greetings:
I am hoping for some help troubleshooting a lockup related to networking. Apologies ahead for the detailed problem report and it is probably obvious I am a newbie to linux-rt, so this may or may not be the appropriate place to post this. Feel free to suggest the right location.
Release:
FC10 release, 2.6.31.6-rt19 #1 SMP PREEMPT RT Wed Nov 18 22:20:20 CST 2009 i686 i686 i386 GNU/Linux
CPU core duo, T8400
Problem:
The problem does not occur when running FC10 release without RT patch.
I have three "threads" on the RT host A (affinity set to CPU0 - it seemed to work the best at reducing jitter):
Thread 1 (started using pthread, priority 49)
Start:
Set posix timer to expire in 5 msec
Output 128 packets (120 bytes each) to a single raw socket to Host B
Go to start
Thread 2 (run from main, priority 37)
Start:
Epoll for a single event on raw socket
Read one packet from Host B
Go to Start
Thread 3 (started from pthread, priority 25)
Print the rx and tx packet count
On other non-RT FC10 host B on network (connected by cisco gigE switch) the same thread is running, so they are exchanging packets. Usually I can get 30 minutes to 4 hours, and then the RT system hangs. Converting Host A to non-rt allows me to run 24 hours or more (no failures recorded). The application is meant to control packet jitter and RT does this well when it doesn't hang. I have also recorded instances of the RT system hanging when my app is not running, however, Host B is pounding the Host A interface with packets. This is more difficult to reproduce and believe I have encountered it only twice out of hundreds of tests. I memlock about 100 Mbyte, only a fraction is used for the reduced test case.
Hang details:
1 The UI freezes. No keyboard or mouse. Graphics OK but screen freeze.
2 Host B reports no data from Host A. When Host B is terminated and unplugged from network, the network card on Host A still blinks as if it is sending or receiving data. Unplugging Host A stops the blinking. Plugging A back in starts the blinking. I have waited up to 20 minutes or more and card still blinking.
Interrupts:
Note- I can make things better and worse by changing these settings, but am unable to resolve problem completely.
This is just last set-up I tried. I realize these may be incorrect and would appreciate some guidance.
This is heavy duty on network side so I have these at high priority. Mostly I am relying on ad-hoc & word of mouth on best settings.
It seems to be a black art. Same is true for stopped services.
I have tried both FF and RR settings.
irq rtc0 set to priority 90
irq eth0 not found.
irq eth1 set to priority 89
irq net-tx/0 set to priority 88
irq net-rx/0 set to priority 87
irq net-tx/1 set to priority 86
irq net-rx/1 set to priority 85
irq tasklet/0 set to priority 84
irq tasklet/1 set to priority 83
irq hrtimer/0 set to priority 82
irq hrtimer/1 set to priority 81
irq i8042 set to priority 20
irq bluetooth set to priority 19
Here is a sample of interrupts while things are working OK
0: 371 7 IO-APIC-edge timer
1: 2 0 IO-APIC-edge i8042
4: 1 1 IO-APIC-edge
7: 0 0 IO-APIC-edge parport0
8: 49 16 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 3 1 IO-APIC-edge i8042
16: 299 18241 IO-APIC-fasteoi uhci_hcd:usb3, HDA Intel
17: 0 1 IO-APIC-fasteoi uhci_hcd:usb4, uhci_hcd:usb7
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb8
22: 1 2 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb5
23: 0 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb6
24: 1228084 0 HPET_MSI-edge hpet2
25: 0 1744782 HPET_MSI-edge hpet3
31: 2288 53354 PCI-MSI-edge ahci
33: 56567 56124 PCI-MSI-edge i915@pci:0000:00:02.0
34: 8035113 7957293 PCI-MSI-edge eth1
NMI: 0 0 Non-maskable interrupts
LOC: 1704 1593 Local timer interrupts
SPU: 0 0 Spurious interrupts
CNT: 0 0 Performance counter interrupts
PND: 0 0 Performance pending work
RES: 10942640 10060362 Rescheduling interrupts
CAL: 5795 2565 Function call interrupts
TLB: 108 158 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 22 22 Machine check polls
ERR: 0
MIS: 0
Performance:
Here is top while things are working OK (user app is called smash)
top - 11:25:27 up 1:48, 4 users, load average: 0.00, 0.00, 0.00
Tasks: 173 total, 1 running, 171 sleeping, 0 stopped, 1 zombie
Cpu(s): 7.0%us, 9.5%sy, 0.0%ni, 76.3%id, 0.0%wa, 0.0%hi, 7.3%si, 0.0%st
Mem: 2004612k total, 590240k used, 1414372k free, 51744k buffers
Swap: 4030456k total, 0k used, 4030456k free, 257776k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11190 root -2 19 122m 122m 3088 S 32.9 6.3 6:48.44 smash
7 root -88 -5 0 0 0 S 8.0 0.0 1:43.43 sirq-net-rx/0
21 root -86 -5 0 0 0 S 7.6 0.0 4:11.65 sirq-net-rx/1
9032 root -90 -5 0 0 0 S 1.7 0.0 0:16.97 irq/34-eth1
1 root 20 0 2008 772 564 S 0.0 0.0 0:02.37 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-high/0
5 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-timer/0
6 root -89 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-net-tx/0
8 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-block/0
9 root -85 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-tasklet/0
10 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-sched/0
11 root -83 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-hrtimer/0
12 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-rcu/0
13 root RT -5 0 0 0 S 0.0 0.0 0:00.00 posixcputmr/0
14 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
15 root 10 -10 0 0 0 S 0.0 0.0 0:00.00 desched/0
16 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/1
17 root RT -5 0 0 0 S 0.0 0.0 0:00.00 posixcputmr/1
18 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-high/1
19 root -50 -5 0 0 0 S 0.0 0.0 0:01.08 sirq-timer/1
20 root -87 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-net-tx/1
22 root -50 -5 0 0 0 S 0.0 0.0 0:00.14 sirq-block/1
23 root -84 -5 0 0 0 S 0.0 0.0 0:00.02 sirq-tasklet/1
24 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-sched/1
25 root -82 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-hrtimer/1
26 root -50 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-rcu/1
27 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
28 root 10 -10 0 0 0 S 0.0 0.0 0:00.01 desched/1
29 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 rcu_sched_grace
30 root -2 -20 0 0 0 S 0.0 0.0 0:00.00 events/0
31 root -2 -20 0 0 0 S 0.0 0.0 0:00.19 events/1
32 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 cpuset
33 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
38 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 async/mgr
161 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
162 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/1
164 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
Services:
Here is a list of services and status.
(Some services respond to status from my script with no text output and a "0" return, so
I call these unable to determine, but they are really dead.)
initd acpid is stopped.
initd anacron is stopped.
initd atd is stopped.
initd auditd is started.
initd avahi-daemon is stopped.
initd bluetooth is stopped.
initd btseed is stopped.
initd bttrack is stopped.
initd cpuspeed is stopped.
initd crond is started.
initd cups is stopped.
initd cups-config-daemon is stopped.
initd dnsmasq is stopped.
initd firstboot is stopped.
initd fuse is started.
initd gpm is stopped.
initd haldaemon is started.
initd halt is stopped.
initd httpd is stopped.
initd ip6tables is stopped.
initd iptables is started.
initd irda is stopped.
initd irqbalance is stopped.
initd jetty is stopped.
initd kerneloops is stopped.
initd killall is stopped.
initd lm_sensors is stopped.
initd mdmonitor is stopped.
initd messagebus is started.
initd microcode_ctl unable to determine state.
initd multipathd is stopped.
initd netconsole is stopped.
initd netfs is stopped.
initd netplugd is stopped.
initd network is started.
initd NetworkManager is stopped.
initd nfs is stopped.
initd nfslock is stopped.
initd nmb is stopped.
initd nscd is stopped.
initd ntpd is stopped.
initd ntpdate is stopped.
initd pcscd is stopped.
initd portreserve is stopped.
initd psacct is stopped.
initd rdisc is stopped.
initd restorecond unable to determine state.
initd rpcbind is stopped.
initd rpcgssd is stopped.
initd rpcidmapd is started.
initd rpcsvcgssd is stopped.
initd rsyslog is stopped.
initd saslauthd is stopped.
initd sendmail is stopped.
initd setroubleshoot is stopped.
initd smartd is stopped.
initd smb is stopped.
initd smolt is stopped.
initd snmpd is stopped.
initd snmptrapd is stopped.
initd sshd is started.
initd udev-post unable to determine state.
initd winbind is stopped.
initd wpa_supplicant is stopped.
initd xinetd is stopped.
initd ypbind is stopped.
Timers:
OK - one thing that confuses me is the timer/clock situation
Any help here on the best settings is appreciated.
I see the following timers - any guidance on implications of changing priority of these?
sirq-timer
sirq-hrtimer
posixcputimer
rtc0
HPET (same as hrtimer?)
APIC:
I'm also confused about this - what is the best state for these services? It doesn't look like APIC is running. APIC interrupts are occurring however, I did include thermal and cpu modules when the kernel was built; everything else excluded.
Many thanks for anyone who has made it this far and still willing to offer some suggestions on helping me debug.
-Bob
next reply other threads:[~2009-11-20 21:24 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-20 20:56 Leyendecker, Robert [this message]
2009-11-20 21:44 ` help needed, 2.6.31.6-rt19 hang with network user app Nikita V. Youshchenko
2009-11-21 3:56 ` Leyendecker, Robert
2009-11-21 8:25 ` Nikita V. Youshchenko
2009-11-21 16:20 ` Leyendecker, Robert
2009-11-23 15:34 ` Leyendecker, Robert
2009-11-24 15:47 ` Leyendecker, Robert
2009-11-27 19:43 ` Thomas Gleixner
2009-11-29 4:43 ` Leyendecker, Robert
2009-12-16 1:55 ` Leyendecker, Robert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8C8865ED624BB94F8FE50259E2B5C5B304593DAA9E@palmail03.lsi.com \
--to=robert.leyendecker@lsi.com \
--cc=linux-rt-users@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.