All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: kernel BUG at mm/vmscan.c:1114
       [not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com>
@ 2011-08-02  7:22     ` Andrew Morton
  2011-08-02 14:24     ` Mel Gorman
  1 sibling, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2011-08-02  7:22 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, mgorman

On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:

> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> was trying to build my kernel. The photo of crash screen and my config
> is attached.

hm, now why has that started happening?

Perhaps you could apply this debug patch, see if we can narrow it down?

--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -54,6 +54,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+#define D() do { printk("%s:%d\n", __FILE__, __LINE__); } while (0)
+
 /*
  * reclaim_mode determines how the inactive list is shrunk
  * RECLAIM_MODE_SINGLE: Reclaim only order-0 pages
@@ -1018,27 +1020,37 @@ int __isolate_lru_page(struct page *page
 	int ret = -EINVAL;
 
 	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
+	if (!PageLRU(page)) {
+		D();
 		return ret;
+	}
 
 	/*
 	 * When checking the active state, we need to be sure we are
 	 * dealing with comparible boolean values.  Take the logical not
 	 * of each.
 	 */
-	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
+	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) {
+		printk("mode:%d\n", mode);
+		D();
 		return ret;
+	}
 
-	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
+	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file) {
+		printk("mode: %d, pifc: %d, file: %d\n", mode,
+					page_is_file_cache(page), file);
+		D();
 		return ret;
-
+	}
 	/*
 	 * When this function is being called for lumpy reclaim, we
 	 * initially look into all LRU pages, active, inactive and
 	 * unevictable; only give shrink_page_list evictable pages.
 	 */
-	if (PageUnevictable(page))
+	if (PageUnevictable(page)) {
+		D();
 		return ret;
+	}
 
 	ret = -EBUSY;
 
_


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-02  7:22     ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2011-08-02  7:22 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, mgorman

On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:

> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> was trying to build my kernel. The photo of crash screen and my config
> is attached.

hm, now why has that started happening?

Perhaps you could apply this debug patch, see if we can narrow it down?

--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -54,6 +54,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+#define D() do { printk("%s:%d\n", __FILE__, __LINE__); } while (0)
+
 /*
  * reclaim_mode determines how the inactive list is shrunk
  * RECLAIM_MODE_SINGLE: Reclaim only order-0 pages
@@ -1018,27 +1020,37 @@ int __isolate_lru_page(struct page *page
 	int ret = -EINVAL;
 
 	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
+	if (!PageLRU(page)) {
+		D();
 		return ret;
+	}
 
 	/*
 	 * When checking the active state, we need to be sure we are
 	 * dealing with comparible boolean values.  Take the logical not
 	 * of each.
 	 */
-	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
+	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) {
+		printk("mode:%d\n", mode);
+		D();
 		return ret;
+	}
 
-	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
+	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file) {
+		printk("mode: %d, pifc: %d, file: %d\n", mode,
+					page_is_file_cache(page), file);
+		D();
 		return ret;
-
+	}
 	/*
 	 * When this function is being called for lumpy reclaim, we
 	 * initially look into all LRU pages, active, inactive and
 	 * unevictable; only give shrink_page_list evictable pages.
 	 */
-	if (PageUnevictable(page))
+	if (PageUnevictable(page)) {
+		D();
 		return ret;
+	}
 
 	ret = -EBUSY;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
       [not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com>
@ 2011-08-02 14:24     ` Mel Gorman
  2011-08-02 14:24     ` Mel Gorman
  1 sibling, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-02 14:24 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, Andrew Morton

On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
> Hi,
>    I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> was trying to build my kernel. The photo of crash screen and my config
> is attached. Thanks.
> Regards
> Xiaotian

I am obviously blind because in 3.0, I cannot see what BUG is at
mm/vmscan.c:1114 :(. I see

1109:			/*
1110:			 * If we don't have enough swap space, reclaiming of
1111:			 * anon page which don't already have a swap slot is
1112:			 * pointless.
1113:			 */
1114:			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
1115:			    !PageSwapCache(cursor_page))
1116:				break;
1117:
1118:			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
1119:				list_move(&cursor_page->lru, dst);
1120:				mem_cgroup_del_lru(cursor_page);

Is this 3.0 vanilla or are there some other patches applied?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-02 14:24     ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-02 14:24 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, Andrew Morton

On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
> Hi,
>    I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> was trying to build my kernel. The photo of crash screen and my config
> is attached. Thanks.
> Regards
> Xiaotian

I am obviously blind because in 3.0, I cannot see what BUG is at
mm/vmscan.c:1114 :(. I see

1109:			/*
1110:			 * If we don't have enough swap space, reclaiming of
1111:			 * anon page which don't already have a swap slot is
1112:			 * pointless.
1113:			 */
1114:			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
1115:			    !PageSwapCache(cursor_page))
1116:				break;
1117:
1118:			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
1119:				list_move(&cursor_page->lru, dst);
1120:				mem_cgroup_del_lru(cursor_page);

Is this 3.0 vanilla or are there some other patches applied?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-02 14:24     ` Mel Gorman
@ 2011-08-02 17:15       ` Andrew Morton
  -1 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2011-08-02 17:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Xiaotian Feng, linux-mm, linux-kernel

On Tue, 2 Aug 2011 15:24:59 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
> > Hi,
> > __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> > was trying to build my kernel. The photo of crash screen and my config
> > is attached. Thanks.
> > Regards
> > Xiaotian
> 
> I am obviously blind because in 3.0, I cannot see what BUG is at
> mm/vmscan.c:1114 :(. I see
> 
> 1109:			/*
> 1110:			 * If we don't have enough swap space, reclaiming of
> 1111:			 * anon page which don't already have a swap slot is
> 1112:			 * pointless.
> 1113:			 */
> 1114:			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> 1115:			    !PageSwapCache(cursor_page))
> 1116:				break;
> 1117:
> 1118:			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> 1119:				list_move(&cursor_page->lru, dst);
> 1120:				mem_cgroup_del_lru(cursor_page);
> 
> Is this 3.0 vanilla or are there some other patches applied?
> 

"3.0.0+": Current mainline.

static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
		struct list_head *src, struct list_head *dst,
		unsigned long *scanned, int order, int mode, int file)
{
	unsigned long nr_taken = 0;
	unsigned long nr_lumpy_taken = 0;
	unsigned long nr_lumpy_dirty = 0;
	unsigned long nr_lumpy_failed = 0;
	unsigned long scan;

	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
		struct page *page;
		unsigned long pfn;
		unsigned long end_pfn;
		unsigned long page_pfn;
		int zone_id;

		page = lru_to_page(src);
		prefetchw_prev_lru_page(page, src, flags);

		VM_BUG_ON(!PageLRU(page));

		switch (__isolate_lru_page(page, mode, file)) {
		case 0:
			list_move(&page->lru, dst);
			mem_cgroup_del_lru(page);
			nr_taken += hpage_nr_pages(page);
			break;

		case -EBUSY:
			/* else it is being freed elsewhere */
			list_move(&page->lru, src);
			mem_cgroup_rotate_lru_list(page, page_lru(page));
			continue;

		default:
-->>			BUG();
		}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-02 17:15       ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2011-08-02 17:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Xiaotian Feng, linux-mm, linux-kernel

On Tue, 2 Aug 2011 15:24:59 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
> > Hi,
> > __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> > was trying to build my kernel. The photo of crash screen and my config
> > is attached. Thanks.
> > Regards
> > Xiaotian
> 
> I am obviously blind because in 3.0, I cannot see what BUG is at
> mm/vmscan.c:1114 :(. I see
> 
> 1109:			/*
> 1110:			 * If we don't have enough swap space, reclaiming of
> 1111:			 * anon page which don't already have a swap slot is
> 1112:			 * pointless.
> 1113:			 */
> 1114:			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> 1115:			    !PageSwapCache(cursor_page))
> 1116:				break;
> 1117:
> 1118:			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> 1119:				list_move(&cursor_page->lru, dst);
> 1120:				mem_cgroup_del_lru(cursor_page);
> 
> Is this 3.0 vanilla or are there some other patches applied?
> 

"3.0.0+": Current mainline.

static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
		struct list_head *src, struct list_head *dst,
		unsigned long *scanned, int order, int mode, int file)
{
	unsigned long nr_taken = 0;
	unsigned long nr_lumpy_taken = 0;
	unsigned long nr_lumpy_dirty = 0;
	unsigned long nr_lumpy_failed = 0;
	unsigned long scan;

	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
		struct page *page;
		unsigned long pfn;
		unsigned long end_pfn;
		unsigned long page_pfn;
		int zone_id;

		page = lru_to_page(src);
		prefetchw_prev_lru_page(page, src, flags);

		VM_BUG_ON(!PageLRU(page));

		switch (__isolate_lru_page(page, mode, file)) {
		case 0:
			list_move(&page->lru, dst);
			mem_cgroup_del_lru(page);
			nr_taken += hpage_nr_pages(page);
			break;

		case -EBUSY:
			/* else it is being freed elsewhere */
			list_move(&page->lru, src);
			mem_cgroup_rotate_lru_list(page, page_lru(page));
			continue;

		default:
-->>			BUG();
		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-02  7:22     ` Andrew Morton
@ 2011-08-03  6:44       ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-03  6:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, mgorman

On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>
>> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> was trying to build my kernel. The photo of crash screen and my config
>> is attached.
>
> hm, now why has that started happening?
>
> Perhaps you could apply this debug patch, see if we can narrow it down?
>

I will try it then, but it isn't very reproducible :(
But my system hung after some list corruption warnings... I hit the
corruption 4 times...

So, Dozens of corruption warnings followed after this one:
 [ 3641.495875] ------------[ cut here ]------------
 [ 3641.495885] WARNING: at lib/list_debug.c:53 __list_del_entry+0xa1/0xd0()
 [ 3641.495888] Hardware name: 42424XC
 [ 3641.495891] list_del corruption. prev->next should be
ffffea00000a6c20, but was ffff880033edde70
 [ 3641.495893] Modules linked in: ip6table_filter ip6_tables
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
iptable_filter ip_tables x_tables bridge stp binfmt_misc parport_pc
ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
arc4 snd_hwdep snd_pcm snd_seq_midi snd_rawmidi cryptd aes_x86_64
iwlagn snd_seq_midi_event aes_generic snd_seq snd_timer snd_seq_device
mac80211 btusb bluetooth snd cfg80211 snd_page_alloc i915 uvcvideo
videodev drm_kms_helper psmouse v4l2_compat_ioctl32 drm tpm_tis tpm lp
soundcore tpm_bios nvram i2c_algo_bit serio_raw joydev parport video
usbhid hid ahci libahci firewire_ohci firewire_core crc_itu_t
sdhci_pci sdhci e1000e
 [ 3641.495987] Pid: 22709, comm: skype Tainted: G        W   3.0.0+ #23
 [ 3641.495989] Call Trace:
 [ 3641.495996]  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
 [ 3641.496001]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
 [ 3641.496006]  [<ffffffff81332a71>] __list_del_entry+0xa1/0xd0
 [ 3641.496010]  [<ffffffff81332ab1>] list_del+0x11/0x40
 [ 3641.496015]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
 [ 3641.496020]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496025]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
 [ 3641.496028]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496032]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496036]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
 [ 3641.496040]  [<ffffffff81519ac9>] __sk_free+0xf9/0x1d0
 [ 3641.496044]  [<ffffffff81519c35>] sk_free+0x25/0x30
 [ 3641.496049]  [<ffffffff81576ac9>] tcp_close+0x239/0x440
 [ 3641.496054]  [<ffffffff815a10ef>] inet_release+0xcf/0x150
 [ 3641.496058]  [<ffffffff815a1042>] ? inet_release+0x22/0x150
 [ 3641.496063]  [<ffffffff81513f19>] sock_release+0x29/0x90
 [ 3641.496067]  [<ffffffff81513f97>] sock_close+0x17/0x30
 [ 3641.496072]  [<ffffffff8119151d>] fput+0xfd/0x240
 [ 3641.496077]  [<ffffffff8118c9b6>] filp_close+0x66/0x90
 [ 3641.496081]  [<ffffffff8118d412>] sys_close+0xc2/0x1a0
 [ 3641.496087]  [<ffffffff81652b60>] sysenter_dispatch+0x7/0x33
 [ 3641.496093]  [<ffffffff8132c8ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f

And after I reboot my system, trying to recover building my kernel,
the system hung again, and I got following  warnings:

 [ 1220.468089] ------------[ cut here ]------------
 [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
 [ 1220.468102] Hardware name: 42424XC
 [ 1220.468104] list_del corruption. next->prev should be
ffffea0000e069a0, but was ffff880100216c78
 [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
sdhci_pci sdhci crc_itu_t
 [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
 [ 1220.468188] Call Trace:
 [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
 [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
 [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
 [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
 [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
 [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
 [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
 [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
 [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
 [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de

So is it possible that my previous BUG is triggered by slab list corruption?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-03  6:44       ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-03  6:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, mgorman

On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>
>> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> was trying to build my kernel. The photo of crash screen and my config
>> is attached.
>
> hm, now why has that started happening?
>
> Perhaps you could apply this debug patch, see if we can narrow it down?
>

I will try it then, but it isn't very reproducible :(
But my system hung after some list corruption warnings... I hit the
corruption 4 times...

So, Dozens of corruption warnings followed after this one:
 [ 3641.495875] ------------[ cut here ]------------
 [ 3641.495885] WARNING: at lib/list_debug.c:53 __list_del_entry+0xa1/0xd0()
 [ 3641.495888] Hardware name: 42424XC
 [ 3641.495891] list_del corruption. prev->next should be
ffffea00000a6c20, but was ffff880033edde70
 [ 3641.495893] Modules linked in: ip6table_filter ip6_tables
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
iptable_filter ip_tables x_tables bridge stp binfmt_misc parport_pc
ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
arc4 snd_hwdep snd_pcm snd_seq_midi snd_rawmidi cryptd aes_x86_64
iwlagn snd_seq_midi_event aes_generic snd_seq snd_timer snd_seq_device
mac80211 btusb bluetooth snd cfg80211 snd_page_alloc i915 uvcvideo
videodev drm_kms_helper psmouse v4l2_compat_ioctl32 drm tpm_tis tpm lp
soundcore tpm_bios nvram i2c_algo_bit serio_raw joydev parport video
usbhid hid ahci libahci firewire_ohci firewire_core crc_itu_t
sdhci_pci sdhci e1000e
 [ 3641.495987] Pid: 22709, comm: skype Tainted: G        W   3.0.0+ #23
 [ 3641.495989] Call Trace:
 [ 3641.495996]  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
 [ 3641.496001]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
 [ 3641.496006]  [<ffffffff81332a71>] __list_del_entry+0xa1/0xd0
 [ 3641.496010]  [<ffffffff81332ab1>] list_del+0x11/0x40
 [ 3641.496015]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
 [ 3641.496020]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496025]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
 [ 3641.496028]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496032]  [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0
 [ 3641.496036]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
 [ 3641.496040]  [<ffffffff81519ac9>] __sk_free+0xf9/0x1d0
 [ 3641.496044]  [<ffffffff81519c35>] sk_free+0x25/0x30
 [ 3641.496049]  [<ffffffff81576ac9>] tcp_close+0x239/0x440
 [ 3641.496054]  [<ffffffff815a10ef>] inet_release+0xcf/0x150
 [ 3641.496058]  [<ffffffff815a1042>] ? inet_release+0x22/0x150
 [ 3641.496063]  [<ffffffff81513f19>] sock_release+0x29/0x90
 [ 3641.496067]  [<ffffffff81513f97>] sock_close+0x17/0x30
 [ 3641.496072]  [<ffffffff8119151d>] fput+0xfd/0x240
 [ 3641.496077]  [<ffffffff8118c9b6>] filp_close+0x66/0x90
 [ 3641.496081]  [<ffffffff8118d412>] sys_close+0xc2/0x1a0
 [ 3641.496087]  [<ffffffff81652b60>] sysenter_dispatch+0x7/0x33
 [ 3641.496093]  [<ffffffff8132c8ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f

And after I reboot my system, trying to recover building my kernel,
the system hung again, and I got following  warnings:

 [ 1220.468089] ------------[ cut here ]------------
 [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
 [ 1220.468102] Hardware name: 42424XC
 [ 1220.468104] list_del corruption. next->prev should be
ffffea0000e069a0, but was ffff880100216c78
 [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
sdhci_pci sdhci crc_itu_t
 [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
 [ 1220.468188] Call Trace:
 [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
 [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
 [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
 [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
 [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
 [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
 [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
 [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
 [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
 [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
 [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de

So is it possible that my previous BUG is triggered by slab list corruption?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-02 14:24     ` Mel Gorman
@ 2011-08-03  6:45       ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-03  6:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, Andrew Morton

On Tue, Aug 2, 2011 at 10:24 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
>> Hi,
>>    I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> was trying to build my kernel. The photo of crash screen and my config
>> is attached. Thanks.
>> Regards
>> Xiaotian
>
> I am obviously blind because in 3.0, I cannot see what BUG is at
> mm/vmscan.c:1114 :(. I see
>
> 1109:                   /*
> 1110:                    * If we don't have enough swap space, reclaiming of
> 1111:                    * anon page which don't already have a swap slot is
> 1112:                    * pointless.
> 1113:                    */
> 1114:                   if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> 1115:                       !PageSwapCache(cursor_page))
> 1116:                           break;
> 1117:
> 1118:                   if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> 1119:                           list_move(&cursor_page->lru, dst);
> 1120:                           mem_cgroup_del_lru(cursor_page);
>
> Is this 3.0 vanilla or are there some other patches applied?

No, I'm using fresh cloned upstream kernel, without any changes. Thanks.

>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-03  6:45       ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-03  6:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, Andrew Morton

On Tue, Aug 2, 2011 at 10:24 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote:
>> Hi,
>>    I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> was trying to build my kernel. The photo of crash screen and my config
>> is attached. Thanks.
>> Regards
>> Xiaotian
>
> I am obviously blind because in 3.0, I cannot see what BUG is at
> mm/vmscan.c:1114 :(. I see
>
> 1109:                   /*
> 1110:                    * If we don't have enough swap space, reclaiming of
> 1111:                    * anon page which don't already have a swap slot is
> 1112:                    * pointless.
> 1113:                    */
> 1114:                   if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> 1115:                       !PageSwapCache(cursor_page))
> 1116:                           break;
> 1117:
> 1118:                   if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> 1119:                           list_move(&cursor_page->lru, dst);
> 1120:                           mem_cgroup_del_lru(cursor_page);
>
> Is this 3.0 vanilla or are there some other patches applied?

No, I'm using fresh cloned upstream kernel, without any changes. Thanks.

>
> --
> Mel Gorman
> SUSE Labs
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-03  6:44       ` Xiaotian Feng
@ 2011-08-03  8:54         ` Mel Gorman
  -1 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-03  8:54 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: Andrew Morton, linux-mm, linux-kernel

On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >
> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >> was trying to build my kernel. The photo of crash screen and my config
> >> is attached.
> >
> > hm, now why has that started happening?
> >
> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >
> 
> I will try it then, but it isn't very reproducible :(
> But my system hung after some list corruption warnings... I hit the
> corruption 4 times...
> 

That is very unexpected but if lists are being corrupted, it could
explain the previously reported bug as that bug looked like an active
page on an inactive list.

What was the last working kernel? Can you bisect?

>  [ 1220.468089] ------------[ cut here ]------------
>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>  [ 1220.468102] Hardware name: 42424XC
>  [ 1220.468104] list_del corruption. next->prev should be
> ffffea0000e069a0, but was ffff880100216c78
>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> sdhci_pci sdhci crc_itu_t
>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>  [ 1220.468188] Call Trace:
>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> 

This warning and the page reclaim warning are on paths that are
commonly used and I would expect to see multiple reports. I wonder
what is happening on your machine that is so unusual.

Have you run memtest on this machine for a few hours and badblocks
on the disk to ensure this is not hardware trouble?

> So is it possible that my previous BUG is triggered by slab list corruption?

Not directly, but clearly there is something very wrong.

If slub corruption reports are very common and kernel 3.0 is fine, my
strongest candidate for the corruption would be the SLUB lockless
patches. Try

git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R

They should revert cleanly with offsets.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-03  8:54         ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-03  8:54 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: Andrew Morton, linux-mm, linux-kernel

On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >
> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >> was trying to build my kernel. The photo of crash screen and my config
> >> is attached.
> >
> > hm, now why has that started happening?
> >
> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >
> 
> I will try it then, but it isn't very reproducible :(
> But my system hung after some list corruption warnings... I hit the
> corruption 4 times...
> 

That is very unexpected but if lists are being corrupted, it could
explain the previously reported bug as that bug looked like an active
page on an inactive list.

What was the last working kernel? Can you bisect?

>  [ 1220.468089] ------------[ cut here ]------------
>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>  [ 1220.468102] Hardware name: 42424XC
>  [ 1220.468104] list_del corruption. next->prev should be
> ffffea0000e069a0, but was ffff880100216c78
>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> sdhci_pci sdhci crc_itu_t
>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>  [ 1220.468188] Call Trace:
>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> 

This warning and the page reclaim warning are on paths that are
commonly used and I would expect to see multiple reports. I wonder
what is happening on your machine that is so unusual.

Have you run memtest on this machine for a few hours and badblocks
on the disk to ensure this is not hardware trouble?

> So is it possible that my previous BUG is triggered by slab list corruption?

Not directly, but clearly there is something very wrong.

If slub corruption reports are very common and kernel 3.0 is fine, my
strongest candidate for the corruption would be the SLUB lockless
patches. Try

git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R

They should revert cleanly with offsets.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-03  8:54         ` Mel Gorman
@ 2011-08-03  9:02           ` Li Zefan
  -1 siblings, 0 replies; 27+ messages in thread
From: Li Zefan @ 2011-08-03  9:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Xiaotian Feng, Andrew Morton, linux-mm, linux-kernel

16:54, Mel Gorman wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>>
>>>> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>>> was trying to build my kernel. The photo of crash screen and my config
>>>> is attached.
>>>
>>> hm, now why has that started happening?
>>>
>>> Perhaps you could apply this debug patch, see if we can narrow it down?
>>>
>>
>> I will try it then, but it isn't very reproducible :(
>> But my system hung after some list corruption warnings... I hit the
>> corruption 4 times...
>>
> 
> That is very unexpected but if lists are being corrupted, it could
> explain the previously reported bug as that bug looked like an active
> page on an inactive list.
> 
> What was the last working kernel? Can you bisect?
> 

I just triggered the same BUG_ON() while running xfstests to test btrfs,
but I forgot to remember which test case was running when it happaned,
case 134 or around.

--
Li Zefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-03  9:02           ` Li Zefan
  0 siblings, 0 replies; 27+ messages in thread
From: Li Zefan @ 2011-08-03  9:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Xiaotian Feng, Andrew Morton, linux-mm, linux-kernel

16:54, Mel Gorman wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>>
>>>> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>>> was trying to build my kernel. The photo of crash screen and my config
>>>> is attached.
>>>
>>> hm, now why has that started happening?
>>>
>>> Perhaps you could apply this debug patch, see if we can narrow it down?
>>>
>>
>> I will try it then, but it isn't very reproducible :(
>> But my system hung after some list corruption warnings... I hit the
>> corruption 4 times...
>>
> 
> That is very unexpected but if lists are being corrupted, it could
> explain the previously reported bug as that bug looked like an active
> page on an inactive list.
> 
> What was the last working kernel? Can you bisect?
> 

I just triggered the same BUG_ON() while running xfstests to test btrfs,
but I forgot to remember which test case was running when it happaned,
case 134 or around.

--
Li Zefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-03  8:54         ` Mel Gorman
@ 2011-08-04  3:54           ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-04  3:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel

On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>> >
>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> >> was trying to build my kernel. The photo of crash screen and my config
>> >> is attached.
>> >
>> > hm, now why has that started happening?
>> >
>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>> >
>>
>> I will try it then, but it isn't very reproducible :(
>> But my system hung after some list corruption warnings... I hit the
>> corruption 4 times...
>>
>
> That is very unexpected but if lists are being corrupted, it could
> explain the previously reported bug as that bug looked like an active
> page on an inactive list.
>
> What was the last working kernel? Can you bisect?
>
>>  [ 1220.468089] ------------[ cut here ]------------
>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>  [ 1220.468102] Hardware name: 42424XC
>>  [ 1220.468104] list_del corruption. next->prev should be
>> ffffea0000e069a0, but was ffff880100216c78
>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>> sdhci_pci sdhci crc_itu_t
>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>  [ 1220.468188] Call Trace:
>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>
>

I'm hitting this again today, when I'm trying to rebuild my kernel....
Looking it a bit

 list_del corruption. next->prev should be ffffea0000e069a0, but was
ffff880100216c78

I find something interesting from my syslog:

 PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144

> This warning and the page reclaim warning are on paths that are
> commonly used and I would expect to see multiple reports. I wonder
> what is happening on your machine that is so unusual.
>
> Have you run memtest on this machine for a few hours and badblocks
> on the disk to ensure this is not hardware trouble?
>
>> So is it possible that my previous BUG is triggered by slab list corruption?
>
> Not directly, but clearly there is something very wrong.
>
> If slub corruption reports are very common and kernel 3.0 is fine, my
> strongest candidate for the corruption would be the SLUB lockless
> patches. Try
>
> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>

I will try it now, thanks.

> They should revert cleanly with offsets.
>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-04  3:54           ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-04  3:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel

On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>> >
>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> >> was trying to build my kernel. The photo of crash screen and my config
>> >> is attached.
>> >
>> > hm, now why has that started happening?
>> >
>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>> >
>>
>> I will try it then, but it isn't very reproducible :(
>> But my system hung after some list corruption warnings... I hit the
>> corruption 4 times...
>>
>
> That is very unexpected but if lists are being corrupted, it could
> explain the previously reported bug as that bug looked like an active
> page on an inactive list.
>
> What was the last working kernel? Can you bisect?
>
>>  [ 1220.468089] ------------[ cut here ]------------
>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>  [ 1220.468102] Hardware name: 42424XC
>>  [ 1220.468104] list_del corruption. next->prev should be
>> ffffea0000e069a0, but was ffff880100216c78
>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>> sdhci_pci sdhci crc_itu_t
>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>  [ 1220.468188] Call Trace:
>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>
>

I'm hitting this again today, when I'm trying to rebuild my kernel....
Looking it a bit

 list_del corruption. next->prev should be ffffea0000e069a0, but was
ffff880100216c78

I find something interesting from my syslog:

 PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144

> This warning and the page reclaim warning are on paths that are
> commonly used and I would expect to see multiple reports. I wonder
> what is happening on your machine that is so unusual.
>
> Have you run memtest on this machine for a few hours and badblocks
> on the disk to ensure this is not hardware trouble?
>
>> So is it possible that my previous BUG is triggered by slab list corruption?
>
> Not directly, but clearly there is something very wrong.
>
> If slub corruption reports are very common and kernel 3.0 is fine, my
> strongest candidate for the corruption would be the SLUB lockless
> patches. Try
>
> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>

I will try it now, thanks.

> They should revert cleanly with offsets.
>
> --
> Mel Gorman
> SUSE Labs
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-04  3:54           ` Xiaotian Feng
@ 2011-08-05  8:42             ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05  8:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel

On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> >
>>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>> >> was trying to build my kernel. The photo of crash screen and my config
>>> >> is attached.
>>> >
>>> > hm, now why has that started happening?
>>> >
>>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>>> >
>>>
>>> I will try it then, but it isn't very reproducible :(
>>> But my system hung after some list corruption warnings... I hit the
>>> corruption 4 times...
>>>
>>
>> That is very unexpected but if lists are being corrupted, it could
>> explain the previously reported bug as that bug looked like an active
>> page on an inactive list.
>>
>> What was the last working kernel? Can you bisect?
>>
>>>  [ 1220.468089] ------------[ cut here ]------------
>>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>>  [ 1220.468102] Hardware name: 42424XC
>>>  [ 1220.468104] list_del corruption. next->prev should be
>>> ffffea0000e069a0, but was ffff880100216c78
>>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>>> sdhci_pci sdhci crc_itu_t
>>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>>  [ 1220.468188] Call Trace:
>>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>>
>>
>
> I'm hitting this again today, when I'm trying to rebuild my kernel....
> Looking it a bit
>
>  list_del corruption. next->prev should be ffffea0000e069a0, but was
> ffff880100216c78
>
> I find something interesting from my syslog:
>
>  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>
>> This warning and the page reclaim warning are on paths that are
>> commonly used and I would expect to see multiple reports. I wonder
>> what is happening on your machine that is so unusual.
>>
>> Have you run memtest on this machine for a few hours and badblocks
>> on the disk to ensure this is not hardware trouble?
>>
>>> So is it possible that my previous BUG is triggered by slab list corruption?
>>
>> Not directly, but clearly there is something very wrong.
>>
>> If slub corruption reports are very common and kernel 3.0 is fine, my
>> strongest candidate for the corruption would be the SLUB lockless
>> patches. Try
>>
>> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>>
>

Here's a update for the results:

3.0.0-rc7: running for hours without a crash
upstream kernel: list corruption happened while building kernel within
10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
upstream kernel with above revert: running for hours without a crash

Trying to bisect but rebuild is slow ....

> I will try it now, thanks.
>
>> They should revert cleanly with offsets.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-05  8:42             ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05  8:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel

On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> >
>>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>> >> was trying to build my kernel. The photo of crash screen and my config
>>> >> is attached.
>>> >
>>> > hm, now why has that started happening?
>>> >
>>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>>> >
>>>
>>> I will try it then, but it isn't very reproducible :(
>>> But my system hung after some list corruption warnings... I hit the
>>> corruption 4 times...
>>>
>>
>> That is very unexpected but if lists are being corrupted, it could
>> explain the previously reported bug as that bug looked like an active
>> page on an inactive list.
>>
>> What was the last working kernel? Can you bisect?
>>
>>>  [ 1220.468089] ------------[ cut here ]------------
>>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>>  [ 1220.468102] Hardware name: 42424XC
>>>  [ 1220.468104] list_del corruption. next->prev should be
>>> ffffea0000e069a0, but was ffff880100216c78
>>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>>> sdhci_pci sdhci crc_itu_t
>>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>>  [ 1220.468188] Call Trace:
>>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>>
>>
>
> I'm hitting this again today, when I'm trying to rebuild my kernel....
> Looking it a bit
>
>  list_del corruption. next->prev should be ffffea0000e069a0, but was
> ffff880100216c78
>
> I find something interesting from my syslog:
>
>  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>
>> This warning and the page reclaim warning are on paths that are
>> commonly used and I would expect to see multiple reports. I wonder
>> what is happening on your machine that is so unusual.
>>
>> Have you run memtest on this machine for a few hours and badblocks
>> on the disk to ensure this is not hardware trouble?
>>
>>> So is it possible that my previous BUG is triggered by slab list corruption?
>>
>> Not directly, but clearly there is something very wrong.
>>
>> If slub corruption reports are very common and kernel 3.0 is fine, my
>> strongest candidate for the corruption would be the SLUB lockless
>> patches. Try
>>
>> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>>
>

Here's a update for the results:

3.0.0-rc7: running for hours without a crash
upstream kernel: list corruption happened while building kernel within
10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
upstream kernel with above revert: running for hours without a crash

Trying to bisect but rebuild is slow ....

> I will try it now, thanks.
>
>> They should revert cleanly with offsets.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-05  8:42             ` Xiaotian Feng
@ 2011-08-05  9:19               ` Mel Gorman
  -1 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-05  9:19 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

(Adding patch author to cc)

On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> >
> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >>> >> was trying to build my kernel. The photo of crash screen and my config
> >>> >> is attached.
> >>> >
> >>> > hm, now why has that started happening?
> >>> >
> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >>> >
> >>>
> >>> I will try it then, but it isn't very reproducible :(
> >>> But my system hung after some list corruption warnings... I hit the
> >>> corruption 4 times...
> >>>
> >>
> >> That is very unexpected but if lists are being corrupted, it could
> >> explain the previously reported bug as that bug looked like an active
> >> page on an inactive list.
> >>
> >> What was the last working kernel? Can you bisect?
> >>
> >>>  [ 1220.468089] ------------[ cut here ]------------
> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
> >>>  [ 1220.468102] Hardware name: 42424XC
> >>>  [ 1220.468104] list_del corruption. next->prev should be
> >>> ffffea0000e069a0, but was ffff880100216c78
> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> >>> sdhci_pci sdhci crc_itu_t
> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
> >>>  [ 1220.468188] Call Trace:
> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> >>>
> >>
> >
> > I'm hitting this again today, when I'm trying to rebuild my kernel....
> > Looking it a bit
> >
> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
> > ffff880100216c78
> >
> > I find something interesting from my syslog:
> >
> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
> >
> >> This warning and the page reclaim warning are on paths that are
> >> commonly used and I would expect to see multiple reports. I wonder
> >> what is happening on your machine that is so unusual.
> >>
> >> Have you run memtest on this machine for a few hours and badblocks
> >> on the disk to ensure this is not hardware trouble?
> >>
> >>> So is it possible that my previous BUG is triggered by slab list corruption?
> >>
> >> Not directly, but clearly there is something very wrong.
> >>
> >> If slub corruption reports are very common and kernel 3.0 is fine, my
> >> strongest candidate for the corruption would be the SLUB lockless
> >> patches. Try
> >>
> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
> >>
> >
> 
> Here's a update for the results:
> 
> 3.0.0-rc7: running for hours without a crash
> upstream kernel: list corruption happened while building kernel within
> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
> upstream kernel with above revert: running for hours without a crash
> 
> Trying to bisect but rebuild is slow ....
> 

If you have not done so already, I strongly suggest your bisection
starts within that range of patches to isolate which one is at fault.
It'll cut down on the number of builds you need to do. Thanks for
testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-05  9:19               ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-05  9:19 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

(Adding patch author to cc)

On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> >
> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >>> >> was trying to build my kernel. The photo of crash screen and my config
> >>> >> is attached.
> >>> >
> >>> > hm, now why has that started happening?
> >>> >
> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >>> >
> >>>
> >>> I will try it then, but it isn't very reproducible :(
> >>> But my system hung after some list corruption warnings... I hit the
> >>> corruption 4 times...
> >>>
> >>
> >> That is very unexpected but if lists are being corrupted, it could
> >> explain the previously reported bug as that bug looked like an active
> >> page on an inactive list.
> >>
> >> What was the last working kernel? Can you bisect?
> >>
> >>>  [ 1220.468089] ------------[ cut here ]------------
> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
> >>>  [ 1220.468102] Hardware name: 42424XC
> >>>  [ 1220.468104] list_del corruption. next->prev should be
> >>> ffffea0000e069a0, but was ffff880100216c78
> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> >>> sdhci_pci sdhci crc_itu_t
> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
> >>>  [ 1220.468188] Call Trace:
> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> >>>
> >>
> >
> > I'm hitting this again today, when I'm trying to rebuild my kernel....
> > Looking it a bit
> >
> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
> > ffff880100216c78
> >
> > I find something interesting from my syslog:
> >
> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
> >
> >> This warning and the page reclaim warning are on paths that are
> >> commonly used and I would expect to see multiple reports. I wonder
> >> what is happening on your machine that is so unusual.
> >>
> >> Have you run memtest on this machine for a few hours and badblocks
> >> on the disk to ensure this is not hardware trouble?
> >>
> >>> So is it possible that my previous BUG is triggered by slab list corruption?
> >>
> >> Not directly, but clearly there is something very wrong.
> >>
> >> If slub corruption reports are very common and kernel 3.0 is fine, my
> >> strongest candidate for the corruption would be the SLUB lockless
> >> patches. Try
> >>
> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
> >>
> >
> 
> Here's a update for the results:
> 
> 3.0.0-rc7: running for hours without a crash
> upstream kernel: list corruption happened while building kernel within
> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
> upstream kernel with above revert: running for hours without a crash
> 
> Trying to bisect but rebuild is slow ....
> 

If you have not done so already, I strongly suggest your bisection
starts within that range of patches to isolate which one is at fault.
It'll cut down on the number of builds you need to do. Thanks for
testing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-05  9:19               ` Mel Gorman
@ 2011-08-05 12:09                 ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05 12:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
> (Adding patch author to cc)
>
> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>> >>> >
>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> >>> >> was trying to build my kernel. The photo of crash screen and my config
>> >>> >> is attached.
>> >>> >
>> >>> > hm, now why has that started happening?
>> >>> >
>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>> >>> >
>> >>>
>> >>> I will try it then, but it isn't very reproducible :(
>> >>> But my system hung after some list corruption warnings... I hit the
>> >>> corruption 4 times...
>> >>>
>> >>
>> >> That is very unexpected but if lists are being corrupted, it could
>> >> explain the previously reported bug as that bug looked like an active
>> >> page on an inactive list.
>> >>
>> >> What was the last working kernel? Can you bisect?
>> >>
>> >>>  [ 1220.468089] ------------[ cut here ]------------
>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>> >>>  [ 1220.468102] Hardware name: 42424XC
>> >>>  [ 1220.468104] list_del corruption. next->prev should be
>> >>> ffffea0000e069a0, but was ffff880100216c78
>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>> >>> sdhci_pci sdhci crc_itu_t
>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>> >>>  [ 1220.468188] Call Trace:
>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>> >>>
>> >>
>> >
>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
>> > Looking it a bit
>> >
>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
>> > ffff880100216c78
>> >
>> > I find something interesting from my syslog:
>> >
>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>> >
>> >> This warning and the page reclaim warning are on paths that are
>> >> commonly used and I would expect to see multiple reports. I wonder
>> >> what is happening on your machine that is so unusual.
>> >>
>> >> Have you run memtest on this machine for a few hours and badblocks
>> >> on the disk to ensure this is not hardware trouble?
>> >>
>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
>> >>
>> >> Not directly, but clearly there is something very wrong.
>> >>
>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
>> >> strongest candidate for the corruption would be the SLUB lockless
>> >> patches. Try
>> >>
>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>> >>
>> >
>>
>> Here's a update for the results:
>>
>> 3.0.0-rc7: running for hours without a crash
>> upstream kernel: list corruption happened while building kernel within
>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
>> upstream kernel with above revert: running for hours without a crash
>>
>> Trying to bisect but rebuild is slow ....
>>
>
> If you have not done so already, I strongly suggest your bisection
> starts within that range of patches to isolate which one is at fault.
> It'll cut down on the number of builds you need to do. Thanks for
> testing.
>

This is interesting, I just change as following:

diff --git a/mm/slub.c b/mm/slub.c
index eb5a8f9..616b78e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node,
                        "__slab_alloc"));

        if (unlikely(!object)) {
-               c->page = NULL;
+               //c->page = NULL;
                stat(s, DEACTIVATE_BYPASS);
+               deactivate_slab(s, c);
                goto new_slab;
        }

Then my system doesn't print any list corruption warnings and my build
success then. So this means revert of 03e404af2 could cure this.
I'll do more test next week to see if the list corruption still exist, thanks.



> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-05 12:09                 ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05 12:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
> (Adding patch author to cc)
>
> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>> >>> >
>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> >>> >> was trying to build my kernel. The photo of crash screen and my config
>> >>> >> is attached.
>> >>> >
>> >>> > hm, now why has that started happening?
>> >>> >
>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>> >>> >
>> >>>
>> >>> I will try it then, but it isn't very reproducible :(
>> >>> But my system hung after some list corruption warnings... I hit the
>> >>> corruption 4 times...
>> >>>
>> >>
>> >> That is very unexpected but if lists are being corrupted, it could
>> >> explain the previously reported bug as that bug looked like an active
>> >> page on an inactive list.
>> >>
>> >> What was the last working kernel? Can you bisect?
>> >>
>> >>>  [ 1220.468089] ------------[ cut here ]------------
>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>> >>>  [ 1220.468102] Hardware name: 42424XC
>> >>>  [ 1220.468104] list_del corruption. next->prev should be
>> >>> ffffea0000e069a0, but was ffff880100216c78
>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>> >>> sdhci_pci sdhci crc_itu_t
>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>> >>>  [ 1220.468188] Call Trace:
>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>> >>>
>> >>
>> >
>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
>> > Looking it a bit
>> >
>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
>> > ffff880100216c78
>> >
>> > I find something interesting from my syslog:
>> >
>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>> >
>> >> This warning and the page reclaim warning are on paths that are
>> >> commonly used and I would expect to see multiple reports. I wonder
>> >> what is happening on your machine that is so unusual.
>> >>
>> >> Have you run memtest on this machine for a few hours and badblocks
>> >> on the disk to ensure this is not hardware trouble?
>> >>
>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
>> >>
>> >> Not directly, but clearly there is something very wrong.
>> >>
>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
>> >> strongest candidate for the corruption would be the SLUB lockless
>> >> patches. Try
>> >>
>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>> >>
>> >
>>
>> Here's a update for the results:
>>
>> 3.0.0-rc7: running for hours without a crash
>> upstream kernel: list corruption happened while building kernel within
>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
>> upstream kernel with above revert: running for hours without a crash
>>
>> Trying to bisect but rebuild is slow ....
>>
>
> If you have not done so already, I strongly suggest your bisection
> starts within that range of patches to isolate which one is at fault.
> It'll cut down on the number of builds you need to do. Thanks for
> testing.
>

This is interesting, I just change as following:

diff --git a/mm/slub.c b/mm/slub.c
index eb5a8f9..616b78e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node,
                        "__slab_alloc"));

        if (unlikely(!object)) {
-               c->page = NULL;
+               //c->page = NULL;
                stat(s, DEACTIVATE_BYPASS);
+               deactivate_slab(s, c);
                goto new_slab;
        }

Then my system doesn't print any list corruption warnings and my build
success then. So this means revert of 03e404af2 could cure this.
I'll do more test next week to see if the list corruption still exist, thanks.



> --
> Mel Gorman
> SUSE Labs
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-05 12:09                 ` Xiaotian Feng
@ 2011-08-05 12:30                   ` Xiaotian Feng
  -1 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
>> (Adding patch author to cc)
>>
>> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
>>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> >>> >
>>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>> >>> >> was trying to build my kernel. The photo of crash screen and my config
>>> >>> >> is attached.
>>> >>> >
>>> >>> > hm, now why has that started happening?
>>> >>> >
>>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>>> >>> >
>>> >>>
>>> >>> I will try it then, but it isn't very reproducible :(
>>> >>> But my system hung after some list corruption warnings... I hit the
>>> >>> corruption 4 times...
>>> >>>
>>> >>
>>> >> That is very unexpected but if lists are being corrupted, it could
>>> >> explain the previously reported bug as that bug looked like an active
>>> >> page on an inactive list.
>>> >>
>>> >> What was the last working kernel? Can you bisect?
>>> >>
>>> >>>  [ 1220.468089] ------------[ cut here ]------------
>>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>> >>>  [ 1220.468102] Hardware name: 42424XC
>>> >>>  [ 1220.468104] list_del corruption. next->prev should be
>>> >>> ffffea0000e069a0, but was ffff880100216c78
>>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>>> >>> sdhci_pci sdhci crc_itu_t
>>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>> >>>  [ 1220.468188] Call Trace:
>>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>> >>>
>>> >>
>>> >
>>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
>>> > Looking it a bit
>>> >
>>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
>>> > ffff880100216c78
>>> >
>>> > I find something interesting from my syslog:
>>> >
>>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>>> >
>>> >> This warning and the page reclaim warning are on paths that are
>>> >> commonly used and I would expect to see multiple reports. I wonder
>>> >> what is happening on your machine that is so unusual.
>>> >>
>>> >> Have you run memtest on this machine for a few hours and badblocks
>>> >> on the disk to ensure this is not hardware trouble?
>>> >>
>>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
>>> >>
>>> >> Not directly, but clearly there is something very wrong.
>>> >>
>>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
>>> >> strongest candidate for the corruption would be the SLUB lockless
>>> >> patches. Try
>>> >>
>>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>>> >>
>>> >
>>>
>>> Here's a update for the results:
>>>
>>> 3.0.0-rc7: running for hours without a crash
>>> upstream kernel: list corruption happened while building kernel within
>>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
>>> upstream kernel with above revert: running for hours without a crash
>>>
>>> Trying to bisect but rebuild is slow ....
>>>
>>
>> If you have not done so already, I strongly suggest your bisection
>> starts within that range of patches to isolate which one is at fault.
>> It'll cut down on the number of builds you need to do. Thanks for
>> testing.
>>
>
> This is interesting, I just change as following:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index eb5a8f9..616b78e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> gfp_t gfpflags, int node,
>                        "__slab_alloc"));
>
>        if (unlikely(!object)) {
> -               c->page = NULL;
> +               //c->page = NULL;
>                stat(s, DEACTIVATE_BYPASS);
> +               deactivate_slab(s, c);
>                goto new_slab;
>        }
>
> Then my system doesn't print any list corruption warnings and my build
> success then. So this means revert of 03e404af2 could cure this.
> I'll do more test next week to see if the list corruption still exist, thanks.
>

Sorry, please ignore it... My system corrupted before I went to leave ....

>
>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-05 12:30                   ` Xiaotian Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Xiaotian Feng @ 2011-08-05 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
>> (Adding patch author to cc)
>>
>> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
>>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
>>> >>> >
>>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>>> >>> >> was trying to build my kernel. The photo of crash screen and my config
>>> >>> >> is attached.
>>> >>> >
>>> >>> > hm, now why has that started happening?
>>> >>> >
>>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>>> >>> >
>>> >>>
>>> >>> I will try it then, but it isn't very reproducible :(
>>> >>> But my system hung after some list corruption warnings... I hit the
>>> >>> corruption 4 times...
>>> >>>
>>> >>
>>> >> That is very unexpected but if lists are being corrupted, it could
>>> >> explain the previously reported bug as that bug looked like an active
>>> >> page on an inactive list.
>>> >>
>>> >> What was the last working kernel? Can you bisect?
>>> >>
>>> >>>  [ 1220.468089] ------------[ cut here ]------------
>>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>> >>>  [ 1220.468102] Hardware name: 42424XC
>>> >>>  [ 1220.468104] list_del corruption. next->prev should be
>>> >>> ffffea0000e069a0, but was ffff880100216c78
>>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>>> >>> sdhci_pci sdhci crc_itu_t
>>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>> >>>  [ 1220.468188] Call Trace:
>>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>> >>>
>>> >>
>>> >
>>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
>>> > Looking it a bit
>>> >
>>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
>>> > ffff880100216c78
>>> >
>>> > I find something interesting from my syslog:
>>> >
>>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
>>> >
>>> >> This warning and the page reclaim warning are on paths that are
>>> >> commonly used and I would expect to see multiple reports. I wonder
>>> >> what is happening on your machine that is so unusual.
>>> >>
>>> >> Have you run memtest on this machine for a few hours and badblocks
>>> >> on the disk to ensure this is not hardware trouble?
>>> >>
>>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
>>> >>
>>> >> Not directly, but clearly there is something very wrong.
>>> >>
>>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
>>> >> strongest candidate for the corruption would be the SLUB lockless
>>> >> patches. Try
>>> >>
>>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>>> >>
>>> >
>>>
>>> Here's a update for the results:
>>>
>>> 3.0.0-rc7: running for hours without a crash
>>> upstream kernel: list corruption happened while building kernel within
>>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
>>> upstream kernel with above revert: running for hours without a crash
>>>
>>> Trying to bisect but rebuild is slow ....
>>>
>>
>> If you have not done so already, I strongly suggest your bisection
>> starts within that range of patches to isolate which one is at fault.
>> It'll cut down on the number of builds you need to do. Thanks for
>> testing.
>>
>
> This is interesting, I just change as following:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index eb5a8f9..616b78e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> gfp_t gfpflags, int node,
>                        "__slab_alloc"));
>
>        if (unlikely(!object)) {
> -               c->page = NULL;
> +               //c->page = NULL;
>                stat(s, DEACTIVATE_BYPASS);
> +               deactivate_slab(s, c);
>                goto new_slab;
>        }
>
> Then my system doesn't print any list corruption warnings and my build
> success then. So this means revert of 03e404af2 could cure this.
> I'll do more test next week to see if the list corruption still exist, thanks.
>

Sorry, please ignore it... My system corrupted before I went to leave ....

>
>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-05 12:30                   ` Xiaotian Feng
@ 2011-08-05 12:55                     ` Mel Gorman
  -1 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-05 12:55 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 05, 2011 at 08:30:44PM +0800, Xiaotian Feng wrote:
> On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> > On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> (Adding patch author to cc)
> >>
> >> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
> >>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> >>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> >>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> >>> >
> >>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >>> >>> >> was trying to build my kernel. The photo of crash screen and my config
> >>> >>> >> is attached.
> >>> >>> >
> >>> >>> > hm, now why has that started happening?
> >>> >>> >
> >>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >>> >>> >
> >>> >>>
> >>> >>> I will try it then, but it isn't very reproducible :(
> >>> >>> But my system hung after some list corruption warnings... I hit the
> >>> >>> corruption 4 times...
> >>> >>>
> >>> >>
> >>> >> That is very unexpected but if lists are being corrupted, it could
> >>> >> explain the previously reported bug as that bug looked like an active
> >>> >> page on an inactive list.
> >>> >>
> >>> >> What was the last working kernel? Can you bisect?
> >>> >>
> >>> >>>  [ 1220.468089] ------------[ cut here ]------------
> >>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
> >>> >>>  [ 1220.468102] Hardware name: 42424XC
> >>> >>>  [ 1220.468104] list_del corruption. next->prev should be
> >>> >>> ffffea0000e069a0, but was ffff880100216c78
> >>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> >>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> >>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> >>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> >>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> >>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> >>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> >>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> >>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> >>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> >>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> >>> >>> sdhci_pci sdhci crc_itu_t
> >>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
> >>> >>>  [ 1220.468188] Call Trace:
> >>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
> >>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
> >>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
> >>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
> >>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
> >>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
> >>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
> >>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
> >>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> >>> >>>
> >>> >>
> >>> >
> >>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
> >>> > Looking it a bit
> >>> >
> >>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
> >>> > ffff880100216c78
> >>> >
> >>> > I find something interesting from my syslog:
> >>> >
> >>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
> >>> >
> >>> >> This warning and the page reclaim warning are on paths that are
> >>> >> commonly used and I would expect to see multiple reports. I wonder
> >>> >> what is happening on your machine that is so unusual.
> >>> >>
> >>> >> Have you run memtest on this machine for a few hours and badblocks
> >>> >> on the disk to ensure this is not hardware trouble?
> >>> >>
> >>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
> >>> >>
> >>> >> Not directly, but clearly there is something very wrong.
> >>> >>
> >>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
> >>> >> strongest candidate for the corruption would be the SLUB lockless
> >>> >> patches. Try
> >>> >>
> >>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
> >>> >>
> >>> >
> >>>
> >>> Here's a update for the results:
> >>>
> >>> 3.0.0-rc7: running for hours without a crash
> >>> upstream kernel: list corruption happened while building kernel within
> >>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
> >>> upstream kernel with above revert: running for hours without a crash
> >>>
> >>> Trying to bisect but rebuild is slow ....
> >>>
> >>
> >> If you have not done so already, I strongly suggest your bisection
> >> starts within that range of patches to isolate which one is at fault.
> >> It'll cut down on the number of builds you need to do. Thanks for
> >> testing.
> >>
> >
> > This is interesting, I just change as following:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index eb5a8f9..616b78e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> > gfp_t gfpflags, int node,
> >                        "__slab_alloc"));
> >
> >        if (unlikely(!object)) {
> > -               c->page = NULL;
> > +               //c->page = NULL;
> >                stat(s, DEACTIVATE_BYPASS);
> > +               deactivate_slab(s, c);
> >                goto new_slab;
> >        }
> >
> > Then my system doesn't print any list corruption warnings and my build
> > success then. So this means revert of 03e404af2 could cure this.
> > I'll do more test next week to see if the list corruption still exist, thanks.
> >
> 
> Sorry, please ignore it... My system corrupted before I went to leave ....
> 

Please continue the bisection in that case and establish for sure if the
problem is in that series or not. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
@ 2011-08-05 12:55                     ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2011-08-05 12:55 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter

On Fri, Aug 05, 2011 at 08:30:44PM +0800, Xiaotian Feng wrote:
> On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> > On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> (Adding patch author to cc)
> >>
> >> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote:
> >>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote:
> >>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
> >>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote:
> >>> >>> >
> >>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
> >>> >>> >> was trying to build my kernel. The photo of crash screen and my config
> >>> >>> >> is attached.
> >>> >>> >
> >>> >>> > hm, now why has that started happening?
> >>> >>> >
> >>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down?
> >>> >>> >
> >>> >>>
> >>> >>> I will try it then, but it isn't very reproducible :(
> >>> >>> But my system hung after some list corruption warnings... I hit the
> >>> >>> corruption 4 times...
> >>> >>>
> >>> >>
> >>> >> That is very unexpected but if lists are being corrupted, it could
> >>> >> explain the previously reported bug as that bug looked like an active
> >>> >> page on an inactive list.
> >>> >>
> >>> >> What was the last working kernel? Can you bisect?
> >>> >>
> >>> >>>  [ 1220.468089] ------------[ cut here ]------------
> >>> >>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
> >>> >>>  [ 1220.468102] Hardware name: 42424XC
> >>> >>>  [ 1220.468104] list_del corruption. next->prev should be
> >>> >>> ffffea0000e069a0, but was ffff880100216c78
> >>> >>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
> >>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> >>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
> >>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
> >>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
> >>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
> >>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
> >>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
> >>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
> >>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
> >>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
> >>> >>> sdhci_pci sdhci crc_itu_t
> >>> >>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
> >>> >>>  [ 1220.468188] Call Trace:
> >>> >>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
> >>> >>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
> >>> >>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
> >>> >>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
> >>> >>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
> >>> >>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
> >>> >>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
> >>> >>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
> >>> >>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
> >>> >>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
> >>> >>>
> >>> >>
> >>> >
> >>> > I'm hitting this again today, when I'm trying to rebuild my kernel....
> >>> > Looking it a bit
> >>> >
> >>> >  list_del corruption. next->prev should be ffffea0000e069a0, but was
> >>> > ffff880100216c78
> >>> >
> >>> > I find something interesting from my syslog:
> >>> >
> >>> >  PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144
> >>> >
> >>> >> This warning and the page reclaim warning are on paths that are
> >>> >> commonly used and I would expect to see multiple reports. I wonder
> >>> >> what is happening on your machine that is so unusual.
> >>> >>
> >>> >> Have you run memtest on this machine for a few hours and badblocks
> >>> >> on the disk to ensure this is not hardware trouble?
> >>> >>
> >>> >>> So is it possible that my previous BUG is triggered by slab list corruption?
> >>> >>
> >>> >> Not directly, but clearly there is something very wrong.
> >>> >>
> >>> >> If slub corruption reports are very common and kernel 3.0 is fine, my
> >>> >> strongest candidate for the corruption would be the SLUB lockless
> >>> >> patches. Try
> >>> >>
> >>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
> >>> >>
> >>> >
> >>>
> >>> Here's a update for the results:
> >>>
> >>> 3.0.0-rc7: running for hours without a crash
> >>> upstream kernel: list corruption happened while building kernel within
> >>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well)
> >>> upstream kernel with above revert: running for hours without a crash
> >>>
> >>> Trying to bisect but rebuild is slow ....
> >>>
> >>
> >> If you have not done so already, I strongly suggest your bisection
> >> starts within that range of patches to isolate which one is at fault.
> >> It'll cut down on the number of builds you need to do. Thanks for
> >> testing.
> >>
> >
> > This is interesting, I just change as following:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index eb5a8f9..616b78e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> > gfp_t gfpflags, int node,
> >                        "__slab_alloc"));
> >
> >        if (unlikely(!object)) {
> > -               c->page = NULL;
> > +               //c->page = NULL;
> >                stat(s, DEACTIVATE_BYPASS);
> > +               deactivate_slab(s, c);
> >                goto new_slab;
> >        }
> >
> > Then my system doesn't print any list corruption warnings and my build
> > success then. So this means revert of 03e404af2 could cure this.
> > I'll do more test next week to see if the list corruption still exist, thanks.
> >
> 
> Sorry, please ignore it... My system corrupted before I went to leave ....
> 

Please continue the bisection in that case and establish for sure if the
problem is in that series or not. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at mm/vmscan.c:1114
  2011-08-05 12:55                     ` Mel Gorman
  (?)
@ 2011-08-05 15:51                     ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2011-08-05 15:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Xiaotian Feng, Andrew Morton, linux-mm, linux-kernel, Pekka Enberg

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2035 bytes --]

On Fri, 5 Aug 2011, Mel Gorman wrote:

> > > This is interesting, I just change as following:
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index eb5a8f9..616b78e 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> > > gfp_t gfpflags, int node,
> > >                        "__slab_alloc"));
> > >
> > >        if (unlikely(!object)) {
> > > -               c->page = NULL;
> > > +               //c->page = NULL;
> > >                stat(s, DEACTIVATE_BYPASS);
> > > +               deactivate_slab(s, c);
> > >                goto new_slab;
> > >        }
> > >
> > > Then my system doesn't print any list corruption warnings and my build
> > > success then. So this means revert of 03e404af2 could cure this.
> > > I'll do more test next week to see if the list corruption still exist, thanks.
> > >
> >
> > Sorry, please ignore it... My system corrupted before I went to leave ....
> >
>
> Please continue the bisection in that case and establish for sure if the
> problem is in that series or not. Thanks.

The above fix should not affect anything since a per cpu slab
is not on any partial lists. And since there are no objects remaining in
the slab there is then also no point of putting it back. It wont be on
any lists before and after the action so no list processing is needed.

Hmmm.... There maybe a race with slab_free from a remote processor. I
dont see any problem here since we convert the page from frozen to
nonfrozen in __slab_alloc and __slab_free will ignore the partial list
management if it sees it to be frozen.

Maybe we need some memory barriers here. Right now we are relying on the
cmpxchg_double for sync of the state in the page struct but we also need
the c->page variable to be consistent with that state. But we disable
interrupts in __slab_alloc so there are no races possible with slab_free
only with remote __slab_free invocations which will not touch c->page.




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-08-05 15:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJn8CcE20-co4xNOD8c+0jMeABrc1mjmGzju3xT34QwHHHFsUA@mail.gmail.com>
     [not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com>
2011-08-02  7:22   ` kernel BUG at mm/vmscan.c:1114 Andrew Morton
2011-08-02  7:22     ` Andrew Morton
2011-08-03  6:44     ` Xiaotian Feng
2011-08-03  6:44       ` Xiaotian Feng
2011-08-03  8:54       ` Mel Gorman
2011-08-03  8:54         ` Mel Gorman
2011-08-03  9:02         ` Li Zefan
2011-08-03  9:02           ` Li Zefan
2011-08-04  3:54         ` Xiaotian Feng
2011-08-04  3:54           ` Xiaotian Feng
2011-08-05  8:42           ` Xiaotian Feng
2011-08-05  8:42             ` Xiaotian Feng
2011-08-05  9:19             ` Mel Gorman
2011-08-05  9:19               ` Mel Gorman
2011-08-05 12:09               ` Xiaotian Feng
2011-08-05 12:09                 ` Xiaotian Feng
2011-08-05 12:30                 ` Xiaotian Feng
2011-08-05 12:30                   ` Xiaotian Feng
2011-08-05 12:55                   ` Mel Gorman
2011-08-05 12:55                     ` Mel Gorman
2011-08-05 15:51                     ` Christoph Lameter
2011-08-02 14:24   ` Mel Gorman
2011-08-02 14:24     ` Mel Gorman
2011-08-02 17:15     ` Andrew Morton
2011-08-02 17:15       ` Andrew Morton
2011-08-03  6:45     ` Xiaotian Feng
2011-08-03  6:45       ` Xiaotian Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.