From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754817Ab2ATQSL (ORCPT ); Fri, 20 Jan 2012 11:18:11 -0500 Received: from mta4.srv.hcvlny.cv.net ([167.206.4.199]:59134 "EHLO mta4.srv.hcvlny.cv.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754620Ab2ATQRm (ORCPT ); Fri, 20 Jan 2012 11:17:42 -0500 Date: Fri, 20 Jan 2012 11:17:51 -0500 From: Michael Breuer Subject: Re: Regression: sky2 kernel between 3.1 and 3.2.1 (last known good 3.0.9) In-reply-to: <20120120081059.1deb4468@s6510.linuxnetplumber.net> To: Stephen Hemminger Cc: Jarek Poplawski , David Miller , Stephen Hemminger , linux-kernel@vger.kernel.org, netdev@vger.kernel.org Message-id: <4F1993AF.1020303@majjas.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7BIT References: <20100120094103.GA6225@ff.dom.local> <4B58B217.8030001@majjas.com> <20100121204133.GB3085@del.dom.local> <4B59E7EB.3050605@majjas.com> <4F1452B1.4010200@majjas.com> <4F197926.6080309@majjas.com> <20120120081059.1deb4468@s6510.linuxnetplumber.net> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0) Gecko/20120118 Thunderbird/10.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/20/2012 11:10 AM, Stephen Hemminger wrote: > On Fri, 20 Jan 2012 09:24:38 -0500 > Michael Breuer wrote: > >> On 1/16/2012 11:39 AM, Michael Breuer wrote: >>> Synopsis: >>> >>> Receiving DMAR and other errors after approximately three days of >>> uptime. The symptoms exactly match errors seen and then fixed around >>> 2.6.32.4. >>> >>> While the system remains unaffected for too long to do a bisect, I was >>> able to confirm that the problem exists in the 3.1 stable branch (I >>> jumped from 3.0 to 3.2 when 3.2. was released). >>> >>> For now I reverted to the sky2.c from 3.0.9 and am running the rest of >>> the kernel from 3.1.2, but won't be certain that this works until >>> later in the week. >>> >>> Note that 20 seconds prior to the log extract below were DHCP renewal >>> attempts on eth1, the issue below was on eth0. Not sure it's relevant, >>> however back in 2010 a preceding DHCP event did turn out to be >>> relevant to the manifestation of the bug. >>> >>> The 3.2.1-dirty I'm running is from git with a single local patch - >>> for sidewinder force-feedback support (shouldn't be relevant to the >>> sky2 issue). >>> >>> Log extract: >>> >>> Jan 16 05:49:46 mail kernel: [198230.628919] DRHD: handling fault >>> status reg 2 >>> Jan 16 05:49:46 mail kernel: [198230.628925] sky2 0000:06:00.0: error >>> interrupt status=0x80000000 >>> Jan 16 05:49:46 mail kernel: [198230.628929] DMAR:[DMA Read] Request >>> device [06:00.0] fault addr fff78000 >>> Jan 16 05:49:46 mail kernel: [198230.628931] DMAR:[fault reason 06] >>> PTE Read access is not set >>> Jan 16 05:49:46 mail kernel: [198230.628939] sky2 0000:06:00.0: PCI >>> hardware error (0x2010) >>> Jan 16 05:49:53 mail dhclient[1616]: DHCPREQUEST on eth1 to >>> 10.240.184.29 port 67 >>> Jan 16 05:50:01 mail kernel: [198246.288400] ------------[ cut here >>> ]------------ >>> Jan 16 05:50:01 mail kernel: [198246.288408] WARNING: at >>> net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250() >>> Jan 16 05:50:01 mail kernel: [198246.288411] Hardware name: System >>> Product Name >>> Jan 16 05:50:01 mail kernel: [198246.288413] NETDEV WATCHDOG: eth0 >>> (sky2): transmit queue 0 timed out >>> Jan 16 05:50:01 mail kernel: [198246.288415] Modules linked in: tcp_lp >>> cpufreq_stats ebtable_nat ebtables nf_conntrack_netbios_ns >>> nf_conntrack_broadcast ip6table_mangle ip6table_filter ip6_tables >>> iptable_mangle ipt_MASQUERADE iptable_nat nf_nat iptable_raw tun >>> bridge stp llc lockd sit tunnel4 ipt_LOG nf_conntrack_ftp >>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_CHECKSUM xt_multiport xt_DSCP >>> w83627ehf xt_mark xt_dscp hwmon_vid binfmt_misc raid1 btrfs sunrpc >>> zlib_deflate libcrc32c snd_hda_codec_analog snd_ens1371 gameport >>> snd_hda_intel snd_rawmidi snd_ac97_codec snd_hda_codec snd_hwdep >>> ac97_bus snd_seq snd_seq_device snd_pcm gspca_spca505 snd_timer >>> gspca_main snd videodev media soundcore i2c_i801 iTCO_wdt microcode >>> v4l2_compat_ioctl32 snd_page_alloc i7core_edac sky2 edac_core pcspkr >>> iTCO_vendor_support virtio_net virtio virtio_ring kvm_intel kvm uinput >>> ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor >>> firewire_ohci firewire_core pata_acpi ata_generic xor async_memcpy >>> async_tx crc_itu_t pata_marvell nouveau ttm d >>> Jan 16 05:50:01 mail kernel: rm_kms_helper drm i2c_algo_bit i2c_core >>> mxm_wmi video [last unloaded: nf_conntrack_broadcast] >>> Jan 16 05:50:01 mail kernel: [198246.288487] Pid: 0, comm: swapper/0 >>> Tainted: G W 3.2.1-dirty #1 >>> Jan 16 05:50:01 mail kernel: [198246.288489] Call Trace: >>> Jan 16 05:50:01 mail kernel: [198246.288491] >>> [] warn_slowpath_common+0x7f/0xc0 >>> Jan 16 05:50:01 mail kernel: [198246.288501] [] ? >>> lapic_next_event+0x1d/0x30 >>> Jan 16 05:50:01 mail kernel: [198246.288504] [] >>> warn_slowpath_fmt+0x46/0x50 >>> Jan 16 05:50:01 mail kernel: [198246.288509] [] ? >>> read_tsc+0x9/0x20 >>> Jan 16 05:50:01 mail kernel: [198246.288513] [] >>> dev_watchdog+0x247/0x250 >>> Jan 16 05:50:01 mail kernel: [198246.288518] [] >>> run_timer_softirq+0x12b/0x3b0 >>> Jan 16 05:50:01 mail kernel: [198246.288521] [] ? >>> qdisc_reset+0x50/0x50 >>> Jan 16 05:50:01 mail kernel: [198246.288525] [] >>> __do_softirq+0xa8/0x210 >>> Jan 16 05:50:01 mail kernel: [198246.288529] [] >>> call_softirq+0x1c/0x30 >>> Jan 16 05:50:01 mail kernel: [198246.288533] [] >>> do_softirq+0x65/0xa0 >>> Jan 16 05:50:01 mail kernel: [198246.288536] [] >>> irq_exit+0x8e/0xb0 >>> Jan 16 05:50:01 mail kernel: [198246.288539] [] >>> do_IRQ+0x63/0xe0 >>> Jan 16 05:50:01 mail kernel: [198246.288543] [] >>> common_interrupt+0x6e/0x6e >>> Jan 16 05:50:01 mail kernel: [198246.288545] >>> [] ? intel_idle+0xed/0x150 >>> Jan 16 05:50:01 mail kernel: [198246.288551] [] ? >>> intel_idle+0xcf/0x150 >>> Jan 16 05:50:01 mail kernel: [198246.288555] [] >>> cpuidle_idle_call+0xc1/0x280 >>> Jan 16 05:50:01 mail kernel: [198246.288559] [] >>> cpu_idle+0xca/0x120 >>> Jan 16 05:50:01 mail kernel: [198246.288563] [] >>> rest_init+0x72/0x74 >>> Jan 16 05:50:01 mail kernel: [198246.288568] [] >>> start_kernel+0x3b5/0x3c0 >>> Jan 16 05:50:01 mail kernel: [198246.288572] [] >>> x86_64_start_reservations+0x132/0x136 >>> Jan 16 05:50:01 mail kernel: [198246.288576] [] ? >>> early_idt_handlers+0x140/0x140 >>> Jan 16 05:50:01 mail kernel: [198246.288580] [] >>> x86_64_start_kernel+0x102/0x111 >>> Jan 16 05:50:01 mail kernel: [198246.288583] ---[ end trace >>> bb26011d21a2b1d7 ]--- >>> Jan 16 05:50:01 mail kernel: [198246.288586] sky2 0000:06:00.0: eth0: >>> tx timeout >>> Jan 16 05:50:01 mail kernel: [198246.288593] sky2 0000:06:00.0: eth0: >>> transmit ring 115 .. 10 report=115 done=115 >>> >>> >>> >> FYI - I've been up for four days now without issues running on 3.2.1 + >> sky2.c from 3.0.9. Looks like the issue is in fact in one of the >> modifications made in sky2.c between those two releases. > Since only you seem to be able to reproduce it, most likely the > bisect burden will be on you. If you know it is only one file, > then bisecting that file is fairly quick. > As of now, I have no reliable way to reproduce... so this is likely to take about 3-4 days per bisect run... more if it doesn't fail. If there are suggestions as to diagnostic code to put in; or specific bias towards one version or another that may reduce the time significantly. I've also got some windows where I have to leave a stable version up.