All of lore.kernel.org
 help / color / mirror / Atom feed
* Please help!  AM35xx mm/slab.c BUG
@ 2012-06-05  6:37 CF Adad
  2012-06-05  7:08 ` Tony Lindgren
  0 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-05  6:37 UTC (permalink / raw)
  To: linux-omap

All,

I'm **really** hoping someone out there can help us with this.

My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.

They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
 consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.

Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
---
/*
* The slab was either on partial or free list so
* there must be at least one object available for
* allocation.
*/
BUG_ON(slabp->inuse >= cachep->num);
---
(It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)



Under our Linux 3.2-rc6, the error is listed as line "3109".  Under our Linux 3.4-rc6, it is line "3175".  I'm FAR from a Linux MM expert, and we've do NO customization there.  So, I'm hoping someone on here will see something they recognize, or at least be able to help us get more debugging in there to try to get more information.

Amazingly, Google has been little help on this one.  It seems we're the only folks in the world having this issue.  Can anyone *please* help us sort out what's happening here?  Or even just suggest where to start looking?

Below is a very recent snapshot of our full boot + the crash, this time on poweroff though it can happen any time.  We have probably close to 15 of these crashes on file if anyone could use more information.  I just chose not to spam them all on here.

(*NOTE: These bootloaders & kernel sources are highly customized now.  So, you may notice that the processor detects as an AM3505 instead of a 3517 because we've removed features unnecessary to our project, such as the SGX, from our software.  That being said, these errors have occurred on the very base builds before any customization.  Please look around that unless you see something really unexplainable.  Given taht we've seen these errors before any of the customizations, we do not believe they are the culprit.  Similarly, those "ECC ERRORS" you see relate to us having to use Micron's on-die ECC for our 4-bit NAND.  We've not gotten the bootloaders and Linux to play nicely with it yet without using the on-die support.  The "ERRORS" are simply referencing the first block, which is managed with a 1-bit ECC instead of the 4-bit.  So, other non-x-loader processes see that section as an error when reading it.)


Thanks in advance for your help!


Texas Instruments X-Loader 1.51-dirty (May 29 2012 - 14:05:44)
Enable Micron on-die ECC engine
Detected board: TEST_TAM3517+
Starting X-loader on MMC 
Reading boot sector
448240 Bytes Read from MMC
Starting 'u-boot.bin' from MMC...
Starting OS Bootloader...


U-Boot 2012.04.01-00083-g0834508-dirty (Jun 04 2012 - 18:35:28)

AM35XX-GP ES2.0, CPU-OPP2, L3-165MHz, Max CPU Clock 600 Mhz
TEST_TAM3517+ Board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  NAND device: Manufacturer ID: 0x2c, Chip ID: 0xbc (Micron NAND 512MiB 1,8V 16-bit)
512 MiB
MMC:   OMAP SD/MMC: 0
Scanning device for bad blocks
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
ERROR: ON-DIE ECC READ FAILURE
*** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Die ID #00000000000averybignumber00000000000
Enabling Micron ECC ... OK
Micron On-Die ECC selected
Net:   DaVinci-EMAC, smc911x-0
Hit any key to stop autoboot:  0

reading uImage

2615512 bytes read
Booting from mmc ...
## Booting kernel from Legacy Image at 82000000 ...
   Image Name:   Linux-3.4.0-rc6
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    2615448 Bytes = 2.5 MiB
   Load Address: 80008000
   Entry Point:  80008000
   Verifying Checksum ... OK
   Loading Kernel Image ... OK
OK

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0
[    0.000000] Linux version 3.4.0-rc6 (user@userbox) (gcc version 4.7.1 20120421 (prerelease) (GCC) ) #8 Mon Jun 4 19:37:56 EDT 2012
[    0.000000] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
[    0.000000] Machine: Technexion TAM3517
[    0.000000] Memory policy: ECC disabled, Data cache writeback
[    0.000000] AM3505 ES1.1 (l2cache neon isp )
[    0.000000] Clocking rate (Crystal/Core/MPU): 26.0/332/600 MHz
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 64768
[    0.000000] Kernel command line: console=ttyO2,115200n8 mem=256M mpurate=600 root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait nohlt0
[    0.000000] PID hash table entries: 1024 (order: 0, 4096 bytes)
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
[    0.000000] Memory: 255MB = 255MB total
[    0.000000] Memory: 253160k/253160k available, 8984k reserved, 0K highmem
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
[    0.000000]     vmalloc : 0xd0800000 - 0xff000000   ( 744 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xd0000000   ( 256 MB)
[    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
[    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
[    0.000000]       .text : 0xc0008000 - 0xc04af000   (4764 kB)
[    0.000000]       .init : 0xc04af000 - 0xc04eb000   ( 240 kB)
[    0.000000]       .data : 0xc04ec000 - 0xc05330a8   ( 285 kB)
[    0.000000]        .bss : 0xc05330e8 - 0xc058734c   ( 337 kB)
[    0.000000] NR_IRQS:460
[    0.000000] IRQ: Found an INTC at 0xfa200000 (revision 4.0) with 96 interrupts
[    0.000000] Total of 96 interrupts on 1 active controller
[    0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz
[    0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
[    0.000000] OMAP clocksource: 32k_counter at 32768 Hz
[    0.000000] Console: colour dummy device 80x30
[    0.000274] Calibrating delay loop... 597.64 BogoMIPS (lpj=2334720)
[    0.039093] pid_max: default: 32768 minimum: 301
[    0.039306] Mount-cache hash table entries: 512
[    0.040161] CPU: Testing write buffer coherency: ok
[    0.040252] ftrace: allocating 13705 entries in 41 pages
[    0.099639] Setting up static identity map for 0x80380d18 - 0x80380d70
[    0.100982] devtmpfs: initialized
[    0.106201] dummy: 
[    0.106689] NET: Registered protocol family 16
[    0.107238] GPMC revision 5.0
[    0.107299] gpmc: irq-20 could not claim: err -22
[    0.110870] gpiochip_add: registered GPIOs 0 to 31 on device: gpio
[    0.110931] OMAP GPIO hardware version 2.5
[    0.111358] gpiochip_add: registered GPIOs 32 to 63 on device: gpio
[    0.111785] gpiochip_add: registered GPIOs 64 to 95 on device: gpio
[    0.112213] gpiochip_add: registered GPIOs 96 to 127 on device: gpio
[    0.112609] gpiochip_add: registered GPIOs 128 to 159 on device: gpio
[    0.113006] gpiochip_add: registered GPIOs 160 to 191 on device: gpio
[    0.114715] omap_mux_init: Add partition: #1: core, flags: 0
[    0.234985] TEST control/display GPIOs initialized
[    0.240295]  omap-mcbsp.2: alias fck already exists
[    0.240692]  omap-mcbsp.3: alias fck already exists
[    0.242004] Switched to new clocking rate (Crystal/Core/MPU): 26.0/332/600 MHz
[    0.242279] OMAP DMA hardware revision 4.0
[    0.255645] bio: create slab <bio-0> at 0
[    0.256591] fixed-dummy: 
[    0.258636] SCSI subsystem initialized
[    0.273529] omap_i2c omap_i2c.1: bus 1 rev1.3.12 at 400 kHz
[    0.289123] omap_i2c omap_i2c.2: bus 2 rev1.3.12 at 400 kHz
[    0.289581] omap_i2c omap_i2c.3: bus 3 rev1.3.12 at 400 kHz
[    0.292022] Advanced Linux Sound Architecture Driver Version 1.0.25.
[    0.293334] Switching to clocksource 32k_counter
[    0.333160] NET: Registered protocol family 2
[    0.333526] IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
[    0.334197] TCP established hash table entries: 8192 (order: 4, 65536 bytes)
[    0.334381] TCP bind hash table entries: 8192 (order: 3, 32768 bytes)
[    0.334503] TCP: Hash tables configured (established 8192 bind 8192)
[    0.334533] TCP: reno registered
[    0.334533] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.334594] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.334869] NET: Registered protocol family 1
[    0.339691] jffs2: version 2.2. (NAND) (SUMMARY)  © 2001-2006 Red Hat, Inc.
[    0.340728] msgmni has been set to 494
[    0.341979] io scheduler noop registered
[    0.342071] io scheduler cfq registered (default)
[    0.342529] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.345031] omap_uart.0: ttyO0 at MMIO 0x4806a000 (irq = 72) is a OMAP UART0
[    0.345672] omap_uart.1: ttyO1 at MMIO 0x4806c000 (irq = 73) is a OMAP UART1
[    0.346252] omap_uart.2: ttyO2 at MMIO 0x49020000 (irq = 74) is a OMAP UART2
[    0.830505] console [ttyO2] enabled
[    0.834869] omap_uart.3: ttyO3 at MMIO 0x4809e000 (irq = 84) is a OMAP UART3
[    0.843170] at24 1-0050: 256 byte 24c02 EEPROM, writable, 1 bytes/write
[    0.852142] GPIO NAND driver, © 2004 Simtec Electronics
[    0.858673] NAND device: Manufacturer ID: 0x2c, Chip ID: 0xbc (Micron NAND 512MiB 1,8V 16-bit)
[    0.867858] NAND_ECC_NONE selected by board driver. This is not recommended!
[    0.875549] Creating 7 MTD partitions on "omap2-nand.0":
[    0.881195] 0x000000000000-0x000000080000 : "X-Loader"
[    0.886657] WARNING :: READ Operation ECC Error: 0xe1 delay 27 usec
[    0.893371] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.900054] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.906768] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.913482] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.920196] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.926879] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.933563] WARNING :: READ Operation ECC Error: 0xe1 delay 28 usec
[    0.942169] 0x000000080000-0x000000240000 : "U-Boot"
[    0.951812] 0x000000240000-0x000000280000 : "U-Boot Env"
[    0.959503] 0x000000280000-0x000000780000 : "Kernel"
[    0.973937] 0x000000780000-0x00000cf80000 : "Root Filesystem (read-only)"
[    1.279815] 0x00000cf80000-0x00001fb80000 : "Root Filesystem (writeable overlay)"
[    1.734985] 0x00001fb80000-0x000020000000 : "DATA"
[    1.749572] smsc911x: Driver version 2008-10-21
[    1.756835] smsc911x-mdio: probed
[    1.760406] smsc911x smsc911x.0: eth0: attached PHY driver [Generic PHY] (mii_bus:phy_addr=smsc911x-0:01, irq=-1)
[    1.802093] smsc911x smsc911x.0: eth0: MAC Address: xx:xx:xx:xx:xx:xx
[    1.856048] davinci_mdio davinci_mdio.0: davinci mdio revision 1.5
[    1.862548] davinci_mdio davinci_mdio.0: detected phy mask fffffffe
[    1.869781] davinci_mdio.0: probed
[    1.873382] davinci_mdio davinci_mdio.0: phy[0]: device davinci_mdio-0:00, driver unknown
[    1.882263] i2c /dev entries driver
[    1.888000] omap_wdt: OMAP Watchdog Timer Rev 0x31: initial timeout 60 sec
[    1.904510] soc-audio soc-audio: ASoC machine TEST should use snd_soc_register_card()
[    1.914825] asoc: TEST-codec-dai <-> omap-mcbsp.2 mapping ok
[    1.922912] asoc: TEST-codec-dai <-> omap-mcbsp.3 mapping ok
[    1.930725] asoc: TEST-codec-dai <-> omap-mcbsp.4 mapping ok
[    1.940155] TEST SoC init
[    1.944519] TCP: cubic registered
[    1.948028] Initializing XFRM netlink socket
[    1.953613] NET: Registered protocol family 17
[    1.960388] NET: Registered protocol family 15
[    1.965179] VFP support v0.3: implementor 41 architecture 3 part 30 variant c rev 1
[    1.973266] ThumbEE CPU extension supported.
[    1.977813] Registering SWP/SWPB emulation handler
[    1.984588] voltdm_scale: No voltage scale API registered for vdd_mpu_iva
[    1.991851] voltdm_scale: No voltage scale API registered for vdd_core
[    1.998809] PM: no software I/O chain control; some wakeups may be lost
[    2.014526] clock: disabling unused clocks to save power
[    2.024963] ALSA device list:
[    2.028167]   #0: TEST
[    2.031402] Waiting for root device /dev/mmcblk0p2...
[    2.053710] mmc0: host does not support reading read-only switch. assuming write-enable.
[    2.063995] mmc0: new high speed SDHC card at address 0001
[    2.070587] mmcblk0: mmc0:0001 00000 7.41 GiB 
[    2.077636]  mmcblk0: p1 p2 p3
[    2.149963] kjournald starting.  Commit interval 5 seconds
[    2.155822] EXT3-fs (mmcblk0p2): warning: maximal mount count reached, running e2fsck is recommended
[    2.171234] EXT3-fs (mmcblk0p2): using internal journal
[    2.176757] EXT3-fs (mmcblk0p2): mounted filesystem with ordered data mode
[    2.184051] VFS: Mounted root (ext3 filesystem) on device 179:2.
[    2.192565] devtmpfs: mounted
[    2.196289] Freeing init memory: 240K

Welcome to The Ångström Distribution!

Starting udev Coldplug all Devices...                                          
Starting Media Directory...                                                    
Starting Remount API VFS...                                                    
Starting Runtime Directory...                                                  
Starting Lock Directory...                                                     
Started Huge Pages File System                                         [  OK  ]
Starting File System Check on Root Device...                                   
Starting Debug File System...                                                  
Started Security File System                                           [  OK  ]
Started Load Kernel Modules                                            [  OK  ]
Started FUSE Control File System                                       [  OK  ]
Started Configuration File System                                      [  OK  ]
Starting Apply Kernel Variables...                                             
Starting udev Kernel Device Manager...                                         
Starting Journal Service...                                                    
Started Journal Service                                                [  OK  ]
[    3.141510] udevd[505]: starting version 181
Starting POSIX Message Queue File System...                                    
Started Set Up Additional Binary Formats                               [  OK  ]
Started udev Kernel Device Manager                                     [  OK  ]
Started Media Directory                                                [  OK  ]
Started Remount API VFS                                                [  OK  ]
Started Runtime Directory                                              [  OK  ]
Started Lock Directory                                                 [  OK  ]
Started File System Check on Root Device                               [  OK  ]
Started Debug File System                                              [  OK  ]
Started Apply Kernel Variables                                         [  OK  ]
Started POSIX Message Queue File System                                [  OK  ]
Starting Remount Root FS...                                                    
Started udev Coldplug all Devices                                      [  OK  ]
Started Remount Root FS                                                [  OK  ]
Started Machine ID first boot configure                                [  OK  ]
Starting /tmp...                                                               
Started Opkg first boot configure                                      [  OK  ]
Started /tmp                                                           [  OK  ]
Starting Recreate Volatile Files and Directories...                            
Starting Load Random Seed...                                                   
Started Load Random Seed                                               [  OK  ]
Started Recreate Volatile Files and Directories                        [  OK  ]
Starting Timestamping service...                                               
Started Timestamping service                                           [  OK  ]
Started SSH Key Generation                                             [  OK  ]
Starting Permit User Sessions...                                               
Starting Login Service...                                                      
Starting Avahi mDNS/DNS-SD Stack...                                            
Starting LSB: ifplugd daemon...                                                
Starting D-Bus System Message Bus...                                           
Started Permit User Sessions                                           [  OK  ]
Starting Getty on tty1...                                                      
Started Getty on tty1                                                  [  OK  ]
Starting Serial Getty on ttyO2...                                              
Started Serial Getty on ttyO2                                          [  OK  ]
Started D-Bus System Message Bus                                       [  OK  ]
[    8.707733] smsc911x smsc911x.0: eth0: SMSC911x/921x identified at 0xd086c000, IRQ: 161
Started Avahi mDNS/DNS-SD Stack                                        [  OK  ]
Started Login Service                                                  [  OK  ]

.---O---.                                           
|       |                  .-.           o o        
|   |   |-----.-----.-----.| |   .----..-----.-----.
|       |     | __  |  ---'| '--.|  .-'|     |     |
|   |   |  |  |     |---  ||  --'|  |  |  '  | | | |
'---'---'--'--'--.  |-----''----''--'  '-----'-'-'-'
                -'  |
                '---'

The Angstrom Distribution TEST-tam3517 ttyO2

Angstrom v2012.05-core - Kernel 3.4.0-rc6

TEST-tam3517 login: Started LSB: ifplugd daemon                                            [  OK  ]
ifplugd[864]: Starting Network Interface Plugging Daemon: eth0.

.---O---.                                           
|       |                  .-.           o o        
|   |   |-----.-----.-----.| |   .----..-----.-----.
|       |     | __  |  ---'| '--.|  .-'|     |     |
|   |   |  |  |     |---  ||  --'|  |  |  '  | | | |
'---'---'--'--'--.  |-----''----''--'  '-----'-'-'-'
                -'  |
                '---'

The Angstrom Distribution TEST-tam3517 ttyO2

Angstrom v2012.05-core - Kernel 3.4.0-rc6

TEST-tam3517 login: root
Last login: Thu Jan  1 03:51:39 UTC 1970 on ttyO2
root@sc_hd1u-tam3517:~# poweroff
Stopping Timestamping service...                                               
Stopping Getty on ttStopping Login Service...                                                      
Stopping Avahi mDNS/DNS-SD Stack...                                            
Stopping D-Bus System Message Bus...                                           
Stopping LSB: ifplugd daemon...                                                
[  188.316284] ------------[ cut here ]------------
[  188.321197] kernel BUG at mm/slab.c:3175!
[  188.325408] Internal error: Oops - BUG: 0 [#1] ARM
[  188.330444] Modules linked in:
[  188.333648] CPU: 0    Not tainted  (3.4.0-rc6 #8)
[  188.338623] PC is at cache_alloc_refill+0x13c/0x534
[  188.343749] LR is at kmem_cache_alloc+0x10c/0x11c
[  188.348693] pc : [<c037cefc>]    lr : [<c00c3e84>]    psr: 60000093
[  188.348724] sp : cee53e48  ip : 0000002d  fp : cee53e8c
[  188.360778] r10: 0000002c  r9 : cf80dec0  r8 : 00000020
[  188.366271] r7 : 00100100  r6 : cf802bc0  r5 : 00200200  r4 : cf80c200
[  188.373107] r3 : cfbc4000  r2 : 00000012  r1 : 0000002c  r0 : 0000002c
[  188.379974] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  188.387573] Control: 10c5387d  Table: 8ee60019  DAC: 00000015
[  188.393615] Process systemd-cgroups (pid: 932, stack limit = 0xcee522f0)
[  188.400665] Stack: (0xcee53e48 to 0xcee54000)
[  188.405242] 3e40:                   cf802bd0 000000d0 000000d0 00000000 000000d0 cf802bc8
[  188.413848] 3e60: cee53ebc 000000d0 60000013 b6f78000 000000d0 c00b89f8 08100071 cf80dec0
[  188.422424] 3e80: cee53ebc cee53e90 c00c3e84 c037cdcc 00000017 cecb37b0 0000001b cec70968
[  188.431030] 3ea0: b6f78000 c0533f10 cecc8200 00000000 cee53eec cee53ec0 c00b89f8 c00c3d84
[  188.439636] 3ec0: cecb37b0 00000200 0000001b 00000000 cee53f7c cfb89820 b6f78000 00000014
[  188.448242] 3ee0: cee53f04 cee53ef0 c00ba4f8 c00b89cc 00000000 cec70968 cee53f6c cee53f08
[  188.456848] 3f00: c00bb15c c00ba4d8 08100071 cfb89820 cfbb8cc0 00000014 00000000 cecc8200
[  188.465423] 3f20: 00000001 08100073 cecc8200 60000013 cecc8234 00000001 b6f77000 00000000
[  188.474029] 3f40: c0380ccc b6f78000 00000001 b6f78000 cee52000 00000001 cee52000 00000000
[  188.482635] 3f60: cee53fa4 cee53f70 c00bb2bc c00bae24 08100071 c0050348 00000022 cec70968
[  188.491241] 3f80: ffffffff b6fddb98 00000000 b6f5e930 0000007d c000e9c4 00000000 cee53fa8
[  188.499847] 3fa0: c000e780 c00bb194 b6fddb98 00000000 b6f77000 00001000 00000001 fffff000
[  188.508453] 3fc0: b6fddb98 00000000 b6f5e930 0000007d b6f5eb80 b6f5bce0 b6f5b000 becb3afc
[  188.517028] 3fe0: 00019d60 becb3a40 b6fc4fb8 b6fcff4c 80000010 b6f77000 00000000 00000000
[  188.525634] Backtrace: 
[  188.528228] [<c037cdc0>] (cache_alloc_refill+0x0/0x534) from [<c00c3e84>] (kmem_cache_alloc+0x10c/0x11c)
[  188.538208] [<c00c3d78>] (kmem_cache_alloc+0x0/0x11c) from [<c00b89f8>] (__split_vma+0x38/0x1bc)
[  188.547454] [<c00b89c0>] (__split_vma+0x0/0x1bc) from [<c00ba4f8>] (split_vma+0x2c/0x3c)
[  188.555969] [<c00ba4cc>] (split_vma+0x0/0x3c) from [<c00bb15c>] (mprotect_fixup+0x344/0x370)
[  188.564819]  r4:cec70968 r3:00000000
[  188.568603] [<c00bae18>] (mprotect_fixup+0x0/0x370) from [<c00bb2bc>] (sys_mprotect+0x134/0x1c4)
[  188.577850] [<c00bb188>] (sys_mprotect+0x0/0x1c4) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[  188.587005]  r8:c000e9c4 r7:0000007d r6:b6f5e930 r5:00000000 r4:b6fddb98
[  188.594055] Code: e5930010 e5991018 e1500001 3a00000e (e7f001f2) 
[  188.600616] ---[ end trace af8909ba8c2e9ac6 ]---
[  188.606811] ------------[ cut here ]------------
[  188.611663] kernel BUG at mm/slab.c:3175!
[  188.615875] Internal error: Oops - BUG: 0 [#2] ARM
[  188.620910] Modules linked in:
[  188.624145] CPU: 0    Tainted: G      D       (3.4.0-rc6 #8)
[  188.630096] PC is at cache_alloc_refill+0x13c/0x534
[  188.635223] LR is at kmem_cache_alloc+0x10c/0x11c
[  188.640167] pc : [<c037cefc>]    lr : [<c00c3e84>]    psr: 60000093
[  188.640167] sp : cf82de30  ip : 000000d0  fp : cf82de74
[  188.652252] r10: 00100100  r9 : cf80dec0  r8 : c002d144
[  188.657745] r7 : 00100100  r6 : cf802bc0  r5 : 00200200  r4 : cf80c200
[  188.664581] r3 : cfbc4000  r2 : 0000003c  r1 : 0000002c  r0 : 0000002c
[  188.671447] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  188.679046] Control: 10c5387d  Table: 8fb94019  DAC: 00000015
[  188.685089] Process systemd (pid: 1, stack limit = 0xcf82c2f0)
[  188.691192] Stack: (0xcf82de30 to 0xcf82e000)
[  188.695800] de20:                                     cf802bd0 000000d0 000000d0 00000000
[  188.704376] de40: 000000d0 cf802bc8 cf82dea4 000000d0 60000013 cec376c0 000000d0 c002d144
[  188.712982] de60: cfbc4814 cf80dec0 cf82dea4 cf82de78 c00c3e84 c037cdcc cf5281e4 cfbc4808
[  188.721588] de80: cfbc4808 cfb8c288 cec376c0 00000000 cfbc4828 cfbc4824 cf82deec cf82dea8
[  188.730194] dea0: c002d144 c00c3d84 ced35c00 cf5281c0 cec376f4 cfb8de34 cfb8de00 cfbc4808
[  188.738800] dec0: cf8269b8 01200011 cf82c000 cec54b80 00000000 c0533f00 cfa1e298 cfa1e100
[  188.747375] dee0: cf82df44 cf82def0 c002db08 c002d008 00000600 00000002 00000000 cfa1e1e8
[  188.755981] df00: 00000000 cf82dfb0 bebdf6d8 00000000 c00d0f58 fffffff4 00000000 01200011
[  188.764587] df20: 00000000 bebdf6d8 00000000 cf82dfb0 cf82c000 00000000 cf82df8c cf82df48
[  188.773193] df40: c002e248 c002d450 b6fc4068 00000000 00000000 000c33f8 000001ed 00000027
[  188.781799] df60: cf82df94 cf82df70 b6fc4068 bebdf6d8 00000001 00000078 c000e9c4 00000000
[  188.790374] df80: cf82dfa4 cf82df90 c0011aa0 c002e134 00000000 b6fc4068 00000000 cf82dfa8
[  188.798980] dfa0: c000e780 c0011a70 b6fc4068 cf82dfb0 01200011 00000000 00000000 00000000
[  188.807586] dfc0: b6fc4068 bebdf6d8 00000001 00000078 b6ee7000 b6fc4000 000c3168 bebdf71c
[  188.816192] dfe0: b6fc44c0 bebdf6d8 00000001 b6e52f88 60000010 01200011 1b631f25 06fb7edf
[  188.824768] Backtrace: 
[  188.827362] [<c037cdc0>] (cache_alloc_refill+0x0/0x534) from [<c00c3e84>] (kmem_cache_alloc+0x10c/0x11c)
[  188.837341] [<c00c3d78>] (kmem_cache_alloc+0x0/0x11c) from [<c002d144>] (dup_mm+0x148/0x3f8)
[  188.846221] [<c002cffc>] (dup_mm+0x0/0x3f8) from [<c002db08>] (copy_process.part.52+0x6c4/0xca8)
[  188.855468] [<c002d444>] (copy_process.part.52+0x0/0xca8) from [<c002e248>] (do_fork+0x120/0x318)
[  188.864807] [<c002e128>] (do_fork+0x0/0x318) from [<c0011aa0>] (sys_clone+0x3c/0x44)
[  188.872955] [<c0011a64>] (sys_clone+0x0/0x44) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[  188.881744] Code: e5930010 e5991018 e1500001 3a00000e (e7f001f2) 
[  188.888244] ---[ end trace af8909ba8c2e9ac7 ]---
[  188.925689] ------------[ cut here ]------------
[  188.930603] kernel BUG at mm/slab.c:3175!
[  188.934814] Internal error: Oops - BUG: 0 [#3] ARM
[  188.939849] Modules linked in:
[  188.943084] CPU: 0    Tainted: G      D       (3.4.0-rc6 #8)
[  188.949066] PC is at cache_alloc_refill+0x13c/0x534
[  188.954193] LR is at kmem_cache_alloc+0x10c/0x11c
[  188.959167] pc : [<c037cefc>]    lr : [<c00c3e84>]    psr: 60000093
[  188.959167] sp : cedc9e30  ip : 000000d0  fp : cedc9e74
[  188.971221] r10: 00100100  r9 : cf80dec0  r8 : c002d144
[  188.976715] r7 : 00100100  r6 : cf802bc0  r5 : 00200200  r4 : cf80c200
[  188.983581] r3 : cfbc4000  r2 : 0000003c  r1 : 0000002c  r0 : 0000002c
[  188.990478] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  188.998077] Control: 10c5387d  Table: 8ed40019  DAC: 00000015
[  189.004119] Process load-timestamp. (pid: 936, stack limit = 0xcedc82f0)
[  189.011169] Stack: (0xcedc9e30 to 0xcedca000)
[  189.015747] 9e20:                                     cf802bd0 000000d0 000000d0 00000000
[  189.024353] 9e40: 000000d0 cf802bc8 cedc9ea4 000000d0 60000013 ced3e3c0 000000d0 c002d144
[  189.032958] 9e60: ced629cc cf80dec0 cedc9ea4 cedc9e78 c00c3e84 c037cdcc 00000001 ced629e0
[  189.041564] 9e80: ced629c0 cfb9b808 ced3e3c0 00000001 ced629e0 ced629dc cedc9eec cedc9ea8
[  189.050170] 9ea0: c002d144 c00c3d84 cecc10f0 cf528f40 ced3e3f4 ced3e574 ced3e540 ced629c0
[  189.058807] 9ec0: cf826a78 01200011 cedc8000 cfb65040 00000000 c0533f00 ced4a218 ced4a080
[  189.067413] 9ee0: cedc9f44 cedc9ef0 c002db08 c002d008 cedc9fac c003cc34 cedc9f1c ced4a168
[  189.076019] 9f00: 00000000 cedc9fb0 be89d2f0 00000000 cedc9f3c fffffff4 c003db70 01200011
[  189.084625] 9f20: 00000000 be89d2f0 00000000 cedc9fb0 cedc8000 00000000 cedc9f8c cedc9f48
[  189.093231] 9f40: c002e248 c002d450 b6fe8398 00000000 00000000 be89d31c 00000000 be89d31c
[  189.101837] 9f60: 00000008 00000000 b6fe8398 00000000 000003a8 00000078 c000e9c4 00000000
[  189.110443] 9f80: cedc9fa4 cedc9f90 c0011aa0 c002e134 00000000 b6fe8398 00000000 cedc9fa8
[  189.119049] 9fa0: c000e780 c0011a70 b6fe8398 cedc9fb0 01200011 00000000 00000000 00000000
[  189.127655] 9fc0: b6fe8398 00000000 000003a8 00000078 b6f95000 b6fe8330 000e2088 be89d314
[  189.136291] 9fe0: b6fe87f0 be89d2f0 000003a8 b6f00f88 60000010 01200011 00000000 00000000
[  189.144866] Backtrace: 
[  189.147460] [<c037cdc0>] (cache_alloc_refill+0x0/0x534) from [<c00c3e84>] (kmem_cache_alloc+0x10c/0x11c)
[  189.157470] [<c00c3d78>] (kmem_cache_alloc+0x0/0x11c) from [<c002d144>] (dup_mm+0x148/0x3f8)
[  189.166351] [<c002cffc>] (dup_mm+0x0/0x3f8) from [<c002db08>] (copy_process.part.52+0x6c4/0xca8)
[  189.175598] [<c002d444>] (copy_process.part.52+0x0/0xca8) from [<c002e248>] (do_fork+0x120/0x318)
[  189.184967] [<c002e128>] (do_fork+0x0/0x318) from [<c0011aa0>] (sys_clone+0x3c/0x44)
[  189.193115] [<c0011a64>] (sys_clone+0x0/0x44) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[  189.201904] Code: e5930010 e5991018 e1500001 3a00000e (e7f001f2) 
[  189.208404] ---[ end trace af8909ba8c2e9ac8 ]---
[  189.217132] ------------[ cut here ]------------
[  189.222015] kernel BUG at mm/slab.c:3175!
[  189.226226] Internal error: Oops - BUG: 0 [#4] ARM
[  189.231262] Modules linked in:
[  189.234466] CPU: 0    Tainted: G      D       (3.4.0-rc6 #8)
[  189.240447] PC is at cache_alloc_refill+0x13c/0x534
[  189.245574] LR is at kmem_cache_alloc+0x10c/0x11c
[  189.250518] pc : [<c037cefc>]    lr : [<c00c3e84>]    psr: 60000093
[  189.250549] sp : cee41e98  ip : 000000d0  fp : cee41edc
[  189.262603] r10: 00100100  r9 : cf80dec0  r8 : c00ccd30
[  189.268096] r7 : 00100100  r6 : cf802bc0  r5 : 00200200  r4 : cf80c200
[  189.274932] r3 : cfbc4000  r2 : 0000003c  r1 : 0000002c  r0 : 0000002c
[  189.281799] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  189.289398] Control: 10c5387d  Table: 8ed68019  DAC: 00000015
[  189.295440] Process load-timestamp. (pid: 937, stack limit = 0xcee402f0)
[  189.302490] Stack: (0xcee41e98 to 0xcee42000)
[  189.307067] 1e80:                                                       cf802bd0 000000d0
[  189.315673] 1ea0: 000000d0 00000000 000080d0 cf802bc8 000e0a08 000080d0 60000013 ced3e840
[  189.324249] 1ec0: 000080d0 c00ccd30 ced3e840 cf80dec0 cee41f0c cee41ee0 c00c3e84 c037cdcc
[  189.332855] 1ee0: ced3e840 cec42b80 cec6c9c0 cec42b80 ced3e840 cee40000 000e0a08 c0533f10
[  189.341461] 1f00: cee41f44 cee41f10 c00ccd30 c00c3d84 cee41f44 cee41f20 c00cbfb0 cec6c9c0
[  189.350067] 1f20: cee40000 cee13000 000e2c48 000e0a08 cec42b80 cee41fb0 cee41f84 cee41f48
[  189.358673] 1f40: c00cd120 c00cccdc c00d0f58 00000001 00000000 cec102a8 000e0a08 cee13000
[  189.367248] 1f60: cee41fb0 000e0a08 000e2c48 c000e9c4 cee40000 00000000 cee41fa4 cee41f88
[  189.375854] 1f80: c0011b28 c00ccfc4 000e21e8 000e2c48 000e0a08 0000000b 00000000 cee41fa8
[  189.384460] 1fa0: c000e780 c0011af0 000e21e8 000e2c48 000e21e8 000e2c48 000e0a08 000d3afc
[  189.393066] 1fc0: 000e21e8 000e2c48 000e0a08 0000000b 00000000 000d9248 00000000 000d6e10
[  189.401672] 1fe0: b6f0126c be89d3cc 00033240 b6f01278 60000010 000e21e8 00000000 00000000
[  189.410247] Backtrace: 
[  189.412841] [<c037cdc0>] (cache_alloc_refill+0x0/0x534) from [<c00c3e84>] (kmem_cache_alloc+0x10c/0x11c)
[  189.422821] [<c00c3d78>] (kmem_cache_alloc+0x0/0x11c) from [<c00ccd30>] (bprm_mm_init+0x60/0x198)
[  189.432159] [<c00cccd0>] (bprm_mm_init+0x0/0x198) from [<c00cd120>] (do_execve+0x168/0x330)
[  189.440948] [<c00ccfb8>] (do_execve+0x0/0x330) from [<c0011b28>] (sys_execve+0x44/0x64)
[  189.449371] [<c0011ae4>] (sys_execve+0x0/0x64) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[  189.458221]  r7:0000000b r6:000e0a08 r5:000e2c48 r4:000e21e8
[  189.464202] Code: e5930010 e5991018 e1500001 3a00000e (e7f001f2) 
[  189.470703] ---[ end trace af8909ba8c2e9ac9 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help!  AM35xx mm/slab.c BUG
  2012-06-05  6:37 Please help! AM35xx mm/slab.c BUG CF Adad
@ 2012-06-05  7:08 ` Tony Lindgren
  2012-06-05 16:29   ` CF Adad
  0 siblings, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2012-06-05  7:08 UTC (permalink / raw)
  To: CF Adad; +Cc: linux-omap

* CF Adad <cfadad@rocketmail.com> [120604 23:47]:
> All,
> 
> I'm **really** hoping someone out there can help us with this.
> 
> My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.
> 
> They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
>  consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.
> 
> Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
> ---
> /*
> * The slab was either on partial or free list so
> * there must be at least one object available for
> * allocation.
> */
> BUG_ON(slabp->inuse >= cachep->num);
> ---
> (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)

I can think of at least three issues causing errors like this:

1. Missing retention/off idle workarounds

   You can test this one by booting with nohlt cmdline option and
   seeing if that helps.

2. Broken memory

   I've seen at least one case of this where things would work
   fine if only half of the memory was in use and devices would
   oops at random point within a week. To test for this you can
   pass cmdline options to artifically partition the memory and
   leave out some chunks to see if that helps. Or boot with
   mem=xxxM set to half of the physical memory. And run your tests
   with SLAB_DEBUG set.

3. Software bugs

   My experience is that things are behaving very reliably regarding
   cache and highmem, so I would check #1 and #2 fist.

Regards,

Tony 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help!  AM35xx mm/slab.c BUG
  2012-06-05  7:08 ` Tony Lindgren
@ 2012-06-05 16:29   ` CF Adad
  2012-06-06  6:14     ` CF Adad
  0 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-05 16:29 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: linux-omap

Hi Tony,

Thanks so much for the response!  All good suggestions.

#1.) Missing retention/off idle workarounds
I'm highly suspect of this one.  I've seen a lot of patches addressing things in this category come out recently for the Sitara series, and we've tried to incorporate everything we've seen.  We also rebased our tree off the linux-omap masteras recently as May 17th.  As I mentioned in the first post, I hope to do this again soon, perhaps today even, to pull in all the good work you folks have done bringing us up to the RCs of 3.5.


Since we discovered the "nohlt" option, we've added it to our default kernel command line and have been using with it.  For a while, I thought maybe that had fixed the glitch, but then yesterday came along...  That crash from the first message occured with 'nohlt' enabled.


#2.) Broken Memory
We really hammered this one as well, as TechNexion delivered our boards with 256MB of NANYA NT5TU64M16GG–AC RAM.  Since we were unfamiliar with that part, we rolled up our sleeves and evaluated every timing and configuration paramter in x-loader using the EMIF4 settings calculator spreadsheet provided by TI.  We also have been running cycles of "memtester 200M" calls, and the board seems to hold up fine under that with both the default, very conservative timings and the more optimized ones we determinded with the TI sheet.

I'll give your suggestion of limiting the memory a shot and see if that makes a difference.  Several of our older captures were run with SLAB_DEBUG set, but it seemed at the time that we weren't getting any more info out of that so we disabled it.  I'll re-enable.

#3.) Software bugs
We're certainly not opposed to the idea that we're doing something wrong.  :)  In fact, that would almost seem likely at this point.

A few other things that may be helpful:

* Could these issues be related to our GPMC?
We're using the SMSC LAN9221 on our board, not the slower LAN9220 that it seems all the AM35xx dev. kits are using.  Frankly, the fastest we could get with that chip was ~40Mbps with a ~1-2% packet loss.  :-(  So, we stepped up to the faster LAN9221 that's used by Gumstix and several others on the OMAP series.  It's running super-well right now (> 80Mbps with 0% loss) with the faster GPMC timings and configuration provided with the Gumstix source.  Is there perhaps a reason all the AM35xx boards were using the LAN9220 instead?  We assumed the AM35xx GPMC was essentially as capable as the OMAP's.  Was that a faulty assumption?

Speaking of GPMC, our NAND that Technexion is delivering requires a 4-bit ECC.  As support for that seems spotty at the moment in the various bootloader and kernel configurations, we finally punted and simply used Micron's on-die engine to do it.  It appears stable, and we've done various filesystem burn-in tests to stress it.  At little while back we also rigged a combination nandtest + iperf across the SMSC to really stress the GPMC.  This too ran fine for several iterations.

*DaVinci EMAC?:
Perhaps it's just my latest thought-of-the-day, but since I saw so many of these things yesterday while focusing on Ethernet work, after seeing none for the past several days doing other work, I can't help but think it may be related to the networks somehow.  Some of our TAM3517's do not have the SMSC hooked up to them.  They are just using their EMAC adapters, but they have exhibited these SLAB crashes too.  So, maybe it's the EMAC?


We've noticed that when we run bandwdith tests between a pair of EMACs using iperf, we get a pretty reduced data rate, maybe 60Mbps.  There is also the occasional dropped packet.  When we connect and EMAC to another port, say a laptop or a Gumstix SMSC, we get blazing performance.  That seems very odd.  It's like the driver is more than capable of producing those high-class speeds, but when two of them get together they agree to dog it.  Could this maybe be related???


Thanks again for you time and help!




----- Original Message -----
From: Tony Lindgren <tony@atomide.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Tuesday, June 5, 2012 3:08 AM
Subject: Re: Please help!  AM35xx mm/slab.c BUG

* CF Adad <cfadad@rocketmail.com> [120604 23:47]:
> All,
> 
> I'm **really** hoping someone out there can help us with this.
> 
> My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.
> 
> They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
>  consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.
> 
> Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
> ---
> /*
> * The slab was either on partial or free list so
> * there must be at least one object available for
> * allocation.
> */
> BUG_ON(slabp->inuse >= cachep->num);
> ---
> (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)

I can think of at least three issues causing errors like this:

1. Missing retention/off idle workarounds

   You can test this one by booting with nohlt cmdline option and
   seeing if that helps.

2. Broken memory

   I've seen at least one case of this where things would work
   fine if only half of the memory was in use and devices would
   oops at random point within a week. To test for this you can
   pass cmdline options to artifically partition the memory and
   leave out some chunks to see if that helps. Or boot with
   mem=xxxM set to half of the physical memory. And run your tests
   with SLAB_DEBUG set.

3. Software bugs

   My experience is that things are behaving very reliably regarding
   cache and highmem, so I would check #1 and #2 fist.

Regards,

Tony 

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help!  AM35xx mm/slab.c BUG
  2012-06-05 16:29   ` CF Adad
@ 2012-06-06  6:14     ` CF Adad
  2012-06-06  6:36       ` Shilimkar, Santosh
  0 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-06  6:14 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: linux-omap

All,


We've learned a few more things:

1.) We have found a way to get it to happen pretty consistently.  We simply run iperf in a loop using the EMAC port to some other device.


2.) The crash ONLY happens on our custom board, not on the Twister dev kit.  This is true despite the fact that I ported our latest linux-omap 3.4-rc6 over there.  We're still running Technexion's default x-loader and u-boot to handle proper configs on that board. So, that's a substantial bit of code that is different between our boxes.  The kernel is altered only in that the few pinmux changes I left in Linux have been removed to avoid configuration differences between the two boards.


This suggests that either:
A) We have a hardware problem on our board.  Seems unlikely.  Can anyone think of anything hardware related that would manifest itself with these sorts of errors?


B) We have a issue in our bootloader code somehwere.  I hesitated to overwrite the bootloaders for this test on the Twister baseboard just because I did not want to have to mess with getting the pinmux's and the like put back and such.

Presuming something in those bootloaders is our problem, I wonder what EMAC-related stuff there really is.  For a long time we ran with our bootloaders NOT initializing either of the Eths.  This was Technexion's default.  They left that work to Linux.  We've recently done work to enable them in u-boot, but we were crashing like this long before that.  Once in Linux, we're just using the standard drivers and calls from within the board file to SMSC911x and the Davinci EMAC drivers.  I am using the patches that allow the e-fused MAC to be pulled from the AM35xx for the EMAC, but I can't see how that would cause this.

Assuming the EMAC is perhaps an innocent bystander that happens just to cause this, the place I would have to suspect the most in our bootloaders would be the GPMC settings.  We've done a good bit of tweaking in there since we switched chips.  *Could a GPMC timing issue account for these types of errors???*  The reason I bring it up is that the GPMC has been one of those things that we've really struggled to understand.  What should the timings *really* be?  We've done the best we can to try to guess our way through it.  BUT, we could certainly be very wrong.  If a GPMC setting could cause these types of bugs, please let me know.  I'll be happy to post more info on how we're setting that up now.  In case not, I'll save the electrons and not spam it here.


Thanks again for all your help!


PS -- If it's useful, here is our latest crash, with SLAB debugging enabled:

[ 5278.124023] slab: Internal list corruption detected in cache 'skbuff_head_cache'(20), slabp cecbb040(4). Tainted(Not tainted). Hex:
[ 5278.136840] 00000000: 00 01 10 00 00 02 20 00 b0 00 00 00 b0 b0 cb ce  ...... .........
[ 5278.145263] 00000010: 04 00 00 00 11 00 00 00 00 00 6b 6b 0f 00 00 00  ..........kk....
[ 5278.153686] 00000020: 03 00 00 00 0c 00 00 00 09 00 00 00 fe ff ff ff  ................
[ 5278.162078] 00000030: fd ff ff ff fd ff ff ff fd ff ff ff 10 00 00 00  ................
[ 5278.170501] 00000040: 02 00 00 00 13 00 00 00 00 00 00 00 ff ff ff ff  ................
[ 5278.178924] 00000050: 00 00 00 00 0b 00 00 00 0d 00 00 00 0a 00 00 00  ................
[ 5278.187316] 00000060: 12 00 00 00 0e 00 00 00 01 00 00 00              ............
[ 5278.195404] ------------[ cut here ]------------
[ 5278.200256] kernel BUG at mm/slab.c:3114!
[ 5278.204467] Internal error: Oops - BUG: 0 [#1] ARM
[ 5278.209503] Modules linked in:
[ 5278.212707] CPU: 0    Not tainted  (3.4.0-rc6 #2)
[ 5278.217681] PC is at check_slabp+0xe4/0xf4
[ 5278.222015] LR is at console_unlock+0x174/0x214
[ 5278.226776] pc : [<c00c3b08>]    lr : [<c002f8e0>]    psr: 80000093
[ 5278.226806] sp : cf83fc40  ip : 00000070  fp : cf83fc74
[ 5278.238861] r10: cecbb3b0  r9 : c04f91c0  r8 : cf812800
[ 5278.244354] r7 : 00000004  r6 : cecbb040  r5 : 00000014  r4 : c0486154
[ 5278.251220] r3 : c0508718  r2 : 20000093  r1 : 00000001  r0 : 0000005d
[ 5278.258117] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 5278.265716] Control: 10c5387d  Table: 8eda0019  DAC: 00000015
[ 5278.271759] Process iperf (pid: 1434, stack limit = 0xcf83e2f0)
[ 5278.277984] Stack: (0xcf83fc40 to 0xcf840000)
[ 5278.282562] fc40: 00000001 cecbb040 0000006c 00000001 cf83fca4 cecbb040 cf812800 cf813a00
[ 5278.291168] fc60: 00000005 cf816464 cf83fcbc cf83fc78 c00c4dbc c00c3a30 cf83fccc 00000000
[ 5278.299804] fc80: 00000010 00200200 00100100 00000000 cf83fce4 cf813a00 00000010 cf816440
[ 5278.308410] fca0: 00000000 cf812800 00000b90 cef89670 cf83fce4 cf83fcc0 c037ea2c c00c4cc0
[ 5278.317016] fcc0: cf81287c cf812800 cf816440 cef89678 c02ec07c 60000013 cf83fd0c cf83fce8
[ 5278.325653] fce0: c00c4b18 c037e998 cef89678 000005a8 000005a8 cedf981c cedf9500 00000000
[ 5278.334259] fd00: cf83fd24 cf83fd10 c02ec07c c00c4a3c cedf953c cef89678 cf83fd84 cf83fd28
[ 5278.342864] fd20: c0325ec0 c02ec034 cf83fd6c c00c3fd0 c03172d4 c03177a0 cfa52800 00000001
[ 5278.351470] fd40: cf83fecc 00000000 00000000 00001470 c05335c0 7fffffff cf83fd94 c0530610
[ 5278.360107] fd60: cf83fecc 00000000 cf621cf0 00000000 cf83fecc 00002000 cf83fdbc cf83fd88
[ 5278.368713] fd80: c0343aa4 c03258a8 00000000 00000000 cf83fd9c cfa74f40 d08b4d80 00000000
[ 5278.377319] fda0: 000006fe 00000000 00000000 00000000 cf83feb4 cf83fdc0 c02e3234 c0343a64
[ 5278.385955] fdc0: 00000000 cfa4ff40 cf83fe1c cf83fdd8 00000000 00002000 cf621cf0 cecbbbf8
[ 5278.394561] fde0: 00000000 cf83fecc cecb9d78 60000113 cfa52c80 cfa52800 cecb9d78 837fee5f
[ 5278.403167] fe00: 00000000 000005ea c05335c0 cecbbbf8 00000000 00000001 ffffffff 00000000
[ 5278.411773] fe20: 00000000 00000000 00000000 00000000 cedc60c0 cfa74f40 00000000 00000000
[ 5278.420410] fe40: c0287e54 c0286b0c cf83fdc8 00000000 d26d4d80 d08d0660 cfa52800 cf83e000
[ 5278.429016] fe60: cf83fe8c cf83fe70 c0287f2c c0288a98 00000001 c02e50d0 cf83fe94 cf83fe88
[ 5278.437622] fe80: c0026c64 c02e33dc cf83febc 00002000 cf621cf0 00000000 cf83fee8 00000000
[ 5278.446258] fea0: cf83e000 00082ee0 cf83ff8c cf83feb8 c02e508c c02e3188 00000001 fffffff7
[ 5278.454864] fec0: 00000001 00083a70 00001470 cf83fee8 00000080 cf83fec4 00000001 00000000
[ 5278.463470] fee0: 00000000 00000001 00000003 c0034d8c 00000100 00000000 00000003 00000010
[ 5278.472076] ff00: cf83ff54 cf83ff10 c0034d8c c0034428 00000044 03419fc0 00000000 0000000a
[ 5278.480712] ff20: c05468c0 00000100 c007632c cf83e000 00000044 c0035218 c050b338 cf83e000
[ 5278.489318] ff40: 00000044 00000000 cf83ff6c cf83ff58 c0035218 c0079544 c007633c c0522b6c
[ 5278.497924] ff60: cf83ff8c cf83ff70 00082ec8 00084ee8 00082ee0 00000123 c000e9c4 00000000
[ 5278.506561] ff80: cf83ffa4 cf83ff90 c02e5104 c02e5000 00000000 00000000 00000000 cf83ffa8
[ 5278.515167] ffa0: c000e780 c02e50e8 00082ec8 00084ee8 00000004 00082ee0 00002000 00000000
[ 5278.523773] ffc0: 00082ec8 00084ee8 00082ee0 00000123 0346bfc0 00000000 00002000 b5ce4f9c
[ 5278.532379] ffe0: 00000000 b5ce4d98 b6e90788 b6e91394 80000010 00000004 6b6b6b6b a56b6b6b
[ 5278.540985] Backtrace: 
[ 5278.543579] [<c00c3a24>] (check_slabp+0x0/0xf4) from [<c00c4dbc>] (free_block+0x108/0x20c)
[ 5278.552276]  r8:cf816464 r7:00000005 r6:cf813a00 r5:cf812800 r4:cecbb040
[ 5278.559387] [<c00c4cb4>] (free_block+0x0/0x20c) from [<c037ea2c>] (cache_flusharray+0xa0/0xfc)
[ 5278.568420] [<c037e98c>] (cache_flusharray+0x0/0xfc) from [<c00c4b18>] (kmem_cache_free+0xe8/0xf0)
[ 5278.577850]  r8:60000013 r7:c02ec07c r6:cef89678 r5:cf816440 r4:cf812800
[ 5278.584747] r3:cf81287c
[ 5278.587524] [<c00c4a30>] (kmem_cache_free+0x0/0xf0) from [<c02ec07c>] (__kfree_skb+0x54/0xcc)
[ 5278.596496] [<c02ec028>] (__kfree_skb+0x0/0xcc) from [<c0325ec0>] (tcp_recvmsg+0x624/0x864)
[ 5278.605285]  r4:cef89678 r3:cedf953c
[ 5278.609069] [<c032589c>] (tcp_recvmsg+0x0/0x864) from [<c0343aa4>] (inet_recvmsg+0x4c/0x60)
[ 5278.617858] [<c0343a58>] (inet_recvmsg+0x0/0x60) from [<c02e3234>] (sock_recvmsg+0xb8/0xd8)
[ 5278.626617]  r6:00000000 r5:00000000 r4:00000000
[ 5278.631500] [<c02e317c>] (sock_recvmsg+0x0/0xd8) from [<c02e508c>] (sys_recvfrom+0x98/0xe8)
[ 5278.640289] [<c02e4ff4>] (sys_recvfrom+0x0/0xe8) from [<c02e5104>] (sys_recv+0x28/0x30)
[ 5278.648712] [<c02e50dc>] (sys_recv+0x0/0x30) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[ 5278.657409] Code: e58d3008 e3a03010 e59f100c eb04f0a3 (e7f001f2) 
[ 5278.668273] ---[ end trace 018554de1af4a1fa ]---
[ 5300.147521] slab: Internal list corruption detected in cache 'skbuff_head_cache'(20), slabp cee4a000(12). Tainted(Tainted: G      :
[ 5300.161437] 00000000: 00 50 d8 ce 00 3a 81 cf 70 00 00 00 70 a0 e4 ce  .P...:..p...p...
[ 5300.169860] 00000010: 0c 00 00 00 07 00 00 00 00 00 6b 6b fd ff ff ff  ..........kk....
[ 5300.178283] 00000020: 05 00 00 00 fd ff ff ff fd ff ff ff fd ff ff ff  ................
[ 5300.186676] 00000030: 06 00 00 00 0a 00 00 00 fd ff ff ff 01 00 00 00  ................
[ 5300.195098] 00000040: fd ff ff ff ff ff ff ff 08 00 00 00 fd ff ff ff  ................
[ 5300.203521] 00000050: fd ff ff ff fd ff ff ff fd ff ff ff fd ff ff ff  ................
[ 5300.211914] 00000060: fd ff ff ff fd ff ff ff fd ff ff ff              ............
[ 5300.220001] ------------[ cut here ]------------
[ 5300.224853] kernel BUG at mm/slab.c:3114!
[ 5300.229064] Internal error: Oops - BUG: 0 [#2] ARM
[ 5300.234100] Modules linked in:
[ 5300.237304] CPU: 0    Tainted: G      D       (3.4.0-rc6 #2)
[ 5300.243286] PC is at check_slabp+0xe4/0xf4
[ 5300.247589] LR is at console_unlock+0x174/0x214
[ 5300.252349] pc : [<c00c3b08>]    lr : [<c002f8e0>]    psr: 80000193
[ 5300.252380] sp : c04efc98  ip : 00000070  fp : c04efccc
[ 5300.264434] r10: cf812800  r9 : fffffffe  r8 : cf812800
[ 5300.269927] r7 : 0000000c  r6 : cee4a000  r5 : 00000014  r4 : c0486154
[ 5300.276763] r3 : c0508718  r2 : 20000193  r1 : 00000001  r0 : 0000005d
[ 5300.283630] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[ 5300.291412] Control: 10c5387d  Table: 8eda0019  DAC: 00000015
[ 5300.297454] Process swapper (pid: 0, stack limit = 0xc04ee2f0)
[ 5300.303588] Stack: (0xc04efc98 to 0xc04f0000)
[ 5300.308166] fc80:                                                       00000001 cee4a000
[ 5300.316741] fca0: 0000006c 00000001 00000009 cee4a000 00000004 0000000c cecb905c cf816440
[ 5300.325347] fcc0: c04efd24 c04efcd0 c037e398 c00c3a30 c04efd74 fb0000e0 00000000 cf813a00
[ 5300.333953] fce0: 00000020 00000020 00000000 00000020 00200200 00100100 c0317ac8 cf812800
[ 5300.342559] fd00: 60000113 00000020 c02eb484 00000020 c05335c0 00000000 c04efd54 c04efd28
[ 5300.351165] fd20: c00c4784 c037e244 cfa52800 00000008 cfa52800 00000020 cf812800 00000634
[ 5300.359741] fd40: c02eba04 00000000 c04efd7c c04efd58 c02eb484 c00c463c cfa52800 cecb3978
[ 5300.368347] fd60: 8385d0e8 00000000 00000073 cecb3978 c04efd94 c04efd80 c02eba04 c02eb454
[ 5300.376953] fd80: cfa52c80 cfa52800 c04efdac c04efd98 c0285c74 c02eb9e4 019caec5 cfa52800
[ 5300.385559] fda0: c04efdd4 c04efdb0 c0286b74 c0285c58 c0017b30 c0548030 cfa4ff40 cfa4ff40
[ 5300.394165] fdc0: 60000113 cfa74f40 c04efdfc c04efdd8 c0287e54 c0286b0c cfa4ff40 00000000
[ 5300.402740] fde0: d26d4260 d08d0660 cfa52800 c04ee000 c04efe1c c04efe00 c0287f2c c0287db0
[ 5300.411346] fe00: 00000000 cfa74f40 00000040 00000040 c04efe3c c04efe20 c0288a98 c0287e6c
[ 5300.419952] fe20: 00000001 cfa52c8c 00000001 00000001 c04efe64 c04efe40 c0287048 c0288a58
[ 5300.428558] fe40: c0286fac cfa52c8c 00000001 00000040 0000012c c0509978 c04efe9c c04efe68
[ 5300.437164] fe60: c02f5694 c0286fb8 00000001 0009c410 c04efebc 00000001 00000003 0000000c
[ 5300.445739] fe80: c05468d0 c05468cc 411fc087 c04ee000 c04efee4 c04efea0 c0034d1c c02f55f0
[ 5300.454345] fea0: 00000044 c04fa1c0 411fc087 0000000a c05468c0 00000100 c007632c c04ee000
[ 5300.462951] fec0: 00000044 00000000 00000044 c04fa1c0 411fc087 00000000 c04efefc c04efee8
[ 5300.471557] fee0: c0035220 c0034c78 c007633c c0522b6c c04eff1c c04eff00 c000f0d0 c00351a0
[ 5300.480163] ff00: 00000044 fa200000 c04eff40 c0535aa0 c04eff3c c04eff20 c00085cc c000f098
[ 5300.488769] ff20: c000f448 20000013 ffffffff c04eff74 c04effac c04eff40 c000e3c0 c0008564
[ 5300.497344] ff40: 00000000 00000000 00000000 00000001 c04ee000 c04ee000 c05351c8 c04ee000
[ 5300.505950] ff60: c04fa1c0 411fc087 00000000 c04effac c04eff30 c04eff88 c00794c8 c000f448
[ 5300.514556] ff80: 20000013 ffffffff 00000000 c04f6f68 c0535140 00000000 c078b140 80004059
[ 5300.523162] ffa0: c04effbc c04effb0 c03753b4 c000f40c c04efff4 c04effc0 c04b179c c0375358
[ 5300.531768] ffc0: 00000000 00000000 c04b12e0 00000000 00000000 c04d5194 10c5387d c04f608c
[ 5300.540344] ffe0: c04d5190 c04fa1b4 00000000 c04efff8 80008040 c04b155c 00000000 00000000
[ 5300.548950] Backtrace: 
[ 5300.551544] [<c00c3a24>] (check_slabp+0x0/0xf4) from [<c037e398>] (cache_alloc_refill+0x160/0x754)
[ 5300.560974]  r8:cf816440 r7:cecb905c r6:0000000c r5:00000004 r4:cee4a000
[ 5300.568054] [<c037e238>] (cache_alloc_refill+0x0/0x754) from [<c00c4784>] (kmem_cache_alloc+0x154/0x164)
[ 5300.578033] [<c00c4630>] (kmem_cache_alloc+0x0/0x164) from [<c02eb484>] (__alloc_skb+0x3c/0xfc)
[ 5300.587188] [<c02eb448>] (__alloc_skb+0x0/0xfc) from [<c02eba04>] (__netdev_alloc_skb+0x2c/0x54)
[ 5300.596435] [<c02eb9d8>] (__netdev_alloc_skb+0x0/0x54) from [<c0285c74>] (emac_rx_alloc+0x28/0x64)
[ 5300.605865]  r4:cfa52800 r3:cfa52c80
[ 5300.609619] [<c0285c4c>] (emac_rx_alloc+0x0/0x64) from [<c0286b74>] (emac_rx_handler+0x74/0x11c)
[ 5300.618865]  r4:cfa52800 r3:019caec5
[ 5300.622619] [<c0286b00>] (emac_rx_handler+0x0/0x11c) from [<c0287e54>] (__cpdma_chan_free+0xb0/0xbc)
[ 5300.632232]  r6:cfa74f40 r5:60000113 r4:cfa4ff40
[ 5300.637115] [<c0287da4>] (__cpdma_chan_free+0x0/0xbc) from [<c0287f2c>] (__cpdma_chan_process+0xcc/0x104)
[ 5300.647186] [<c0287e60>] (__cpdma_chan_process+0x0/0x104) from [<c0288a98>] (cpdma_chan_process+0x4c/0x64)
[ 5300.657318]  r7:00000040 r6:00000040 r5:cfa74f40 r4:00000000
[ 5300.663299] [<c0288a4c>] (cpdma_chan_process+0x0/0x64) from [<c0287048>] (emac_poll+0x9c/0x208)
[ 5300.672424]  r6:00000001 r5:00000001 r4:cfa52c8c r3:00000001
[ 5300.678405] [<c0286fac>] (emac_poll+0x0/0x208) from [<c02f5694>] (net_rx_action+0xb0/0x1a8)
[ 5300.687194]  r8:c0509978 r7:0000012c r6:00000040 r5:00000001 r4:cfa52c8c
[ 5300.694061] r3:c0286fac
[ 5300.696838] [<c02f55e4>] (net_rx_action+0x0/0x1a8) from [<c0034d1c>] (__do_softirq+0xb0/0x1d8)
[ 5300.705902] [<c0034c6c>] (__do_softirq+0x0/0x1d8) from [<c0035220>] (irq_exit+0x8c/0x94)
[ 5300.714416] [<c0035194>] (irq_exit+0x0/0x94) from [<c000f0d0>] (handle_IRQ+0x44/0x94)
[ 5300.722656]  r4:c0522b6c r3:c007633c
[ 5300.726409] [<c000f08c>] (handle_IRQ+0x0/0x94) from [<c00085cc>] (omap3_intc_handle_irq+0x74/0x84)
[ 5300.735839]  r6:c0535aa0 r5:c04eff40 r4:fa200000 r3:00000044
[ 5300.741821] [<c0008558>] (omap3_intc_handle_irq+0x0/0x84) from [<c000e3c0>] (__irq_svc+0x40/0x60)
[ 5300.751129] Exception stack(0xc04eff40 to 0xc04eff88)
[ 5300.756439] ff40: 00000000 00000000 00000000 00000001 c04ee000 c04ee000 c05351c8 c04ee000
[ 5300.765045] ff60: c04fa1c0 411fc087 00000000 c04effac c04eff30 c04eff88 c00794c8 c000f448
[ 5300.773651] ff80: 20000013 ffffffff
[ 5300.777313]  r7:c04eff74 r6:ffffffff r5:20000013 r4:c000f448
[ 5300.783294] [<c000f400>] (cpu_idle+0x0/0xb8) from [<c03753b4>] (rest_init+0x68/0x80)
[ 5300.791412]  r8:80004059 r7:c078b140 r6:00000000 r5:c0535140 r4:c04f6f68
[ 5300.798309] r3:00000000
[ 5300.801055] [<c037534c>] (rest_init+0x0/0x80) from [<c04b179c>] (start_kernel+0x24c/0x290)
[ 5300.809753] [<c04b1550>] (start_kernel+0x0/0x290) from [<80008040>] (0x80008040)
[ 5300.817535] Code: e58d3008 e3a03010 e59f100c eb04f0a3 (e7f001f2) 
[ 5300.824005] ---[ end trace 018554de1af4a1fb ]---
[ 5300.828887] Kernel panic - not syncing: Fatal exception in interrupt



----- Original Message -----
From: CF Adad <cfadad@rocketmail.com>
To: Tony Lindgren <tony@atomide.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Tuesday, June 5, 2012 12:29 PM
Subject: Re: Please help!  AM35xx mm/slab.c BUG

Hi Tony,

Thanks so much for the response!  All good suggestions.

#1.) Missing retention/off idle workarounds
I'm highly suspect of this one.  I've seen a lot of patches addressing things in this category come out recently for the Sitara series, and we've tried to incorporate everything we've seen.  We also rebased our tree off the linux-omap masteras recently as May 17th.  As I mentioned in the first post, I hope to do this again soon, perhaps today even, to pull in all the good work you folks have done bringing us up to the RCs of 3.5.


Since we discovered the "nohlt" option, we've added it to our default kernel command line and have been using with it.  For a while, I thought maybe that had fixed the glitch, but then yesterday came along...  That crash from the first message occured with 'nohlt' enabled.


#2.) Broken Memory
We really hammered this one as well, as TechNexion delivered our boards with 256MB of NANYA NT5TU64M16GG–AC RAM.  Since we were unfamiliar with that part, we rolled up our sleeves and evaluated every timing and configuration paramter in x-loader using the EMIF4 settings calculator spreadsheet provided by TI.  We also have been running cycles of "memtester 200M" calls, and the board seems to hold up fine under that with both the default, very conservative timings and the more optimized ones we determinded with the TI sheet.

I'll give your suggestion of limiting the memory a shot and see if that makes a difference.  Several of our older captures were run with SLAB_DEBUG set, but it seemed at the time that we weren't getting any more info out of that so we disabled it.  I'll re-enable.

#3.) Software bugs
We're certainly not opposed to the idea that we're doing something wrong.  :)  In fact, that would almost seem likely at this point.

A few other things that may be helpful:

* Could these issues be related to our GPMC?
We're using the SMSC LAN9221 on our board, not the slower LAN9220 that it seems all the AM35xx dev. kits are using.  Frankly, the fastest we could get with that chip was ~40Mbps with a ~1-2% packet loss.  :-(  So, we stepped up to the faster LAN9221 that's used by Gumstix and several others on the OMAP series.  It's running super-well right now (> 80Mbps with 0% loss) with the faster GPMC timings and configuration provided with the Gumstix source.  Is there perhaps a reason all the AM35xx boards were using the LAN9220 instead?  We assumed the AM35xx GPMC was essentially as capable as the OMAP's.  Was that a faulty assumption?

Speaking of GPMC, our NAND that Technexion is delivering requires a 4-bit ECC.  As support for that seems spotty at the moment in the various bootloader and kernel configurations, we finally punted and simply used Micron's on-die engine to do it.  It appears stable, and we've done various filesystem burn-in tests to stress it.  At little while back we also rigged a combination nandtest + iperf across the SMSC to really stress the GPMC.  This too ran fine for several iterations.

*DaVinci EMAC?:
Perhaps it's just my latest thought-of-the-day, but since I saw so many of these things yesterday while focusing on Ethernet work, after seeing none for the past several days doing other work, I can't help but think it may be related to the networks somehow.  Some of our TAM3517's do not have the SMSC hooked up to them.  They are just using their EMAC adapters, but they have exhibited these SLAB crashes too.  So, maybe it's the EMAC?


We've noticed that when we run bandwdith tests between a pair of EMACs using iperf, we get a pretty reduced data rate, maybe 60Mbps.  There is also the occasional dropped packet.  When we connect and EMAC to another port, say a laptop or a Gumstix SMSC, we get blazing performance.  That seems very odd.  It's like the driver is more than capable of producing those high-class speeds, but when two of them get together they agree to dog it.  Could this maybe be related???


Thanks again for you time and help!




----- Original Message -----
From: Tony Lindgren <tony@atomide.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Tuesday, June 5, 2012 3:08 AM
Subject: Re: Please help!  AM35xx mm/slab.c BUG

* CF Adad <cfadad@rocketmail.com> [120604 23:47]:
> All,
> 
> I'm **really** hoping someone out there can help us with this.
> 
> My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.
> 
> They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
>  consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.
> 
> Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
> ---
> /*
> * The slab was either on partial or free list so
> * there must be at least one object available for
> * allocation.
> */
> BUG_ON(slabp->inuse >= cachep->num);
> ---
> (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)

I can think of at least three issues causing errors like this:

1. Missing retention/off idle workarounds

   You can test this one by booting with nohlt cmdline option and
   seeing if that helps.

2. Broken memory

   I've seen at least one case of this where things would work
   fine if only half of the memory was in use and devices would
   oops at random point within a week. To test for this you can
   pass cmdline options to artifically partition the memory and
   leave out some chunks to see if that helps. Or boot with
   mem=xxxM set to half of the physical memory. And run your tests
   with SLAB_DEBUG set.

3. Software bugs

   My experience is that things are behaving very reliably regarding
   cache and highmem, so I would check #1 and #2 fist.

Regards,

Tony 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  6:14     ` CF Adad
@ 2012-06-06  6:36       ` Shilimkar, Santosh
  2012-06-06  7:08         ` CF Adad
  2012-06-06  7:10         ` Tony Lindgren
  0 siblings, 2 replies; 20+ messages in thread
From: Shilimkar, Santosh @ 2012-06-06  6:36 UTC (permalink / raw)
  To: CF Adad; +Cc: Tony Lindgren, linux-omap

On Wed, Jun 6, 2012 at 11:44 AM, CF Adad <cfadad@rocketmail.com> wrote:
> All,
>
>
> We've learned a few more things:
>
> 1.) We have found a way to get it to happen pretty consistently.  We simply run iperf in a loop using the EMAC port to some other device.
>
>
> 2.) The crash ONLY happens on our custom board, not on the Twister dev kit.  This is true despite the fact that I ported our latest linux-omap 3.4-rc6 over there.  We're still running Technexion's default x-loader and u-boot to handle proper configs on that board. So, that's a substantial bit of code that is different between our boxes.  The kernel is altered only in that the few pinmux changes I left in Linux have been removed to avoid configuration differences between the two boards.
>
>
> This suggests that either:
> A) We have a hardware problem on our board.  Seems unlikely.  Can anyone think of anything hardware related that would manifest itself with these sorts of errors?
>
>
> B) We have a issue in our bootloader code somehwere.  I hesitated to overwrite the bootloaders for this test on the Twister baseboard just because I did not want to have to mess with getting the pinmux's and the like put back and such.
>
> Presuming something in those bootloaders is our problem, I wonder what EMAC-related stuff there really is.  For a long time we ran with our bootloaders NOT initializing either of the Eths.  This was Technexion's default.  They left that work to Linux.  We've recently done work to enable them in u-boot, but we were crashing like this long before that.  Once in Linux, we're just using the standard drivers and calls from within the board file to SMSC911x and the Davinci EMAC drivers.  I am using the patches that allow the e-fused MAC to be pulled from the AM35xx for the EMAC, but I can't see how that would cause this.
>
> Assuming the EMAC is perhaps an innocent bystander that happens just to cause this, the place I would have to suspect the most in our bootloaders would be the GPMC settings.  We've done a good bit of tweaking in there since we switched chips.  *Could a GPMC timing issue account for these types of errors???*  The reason I bring it up is that the GPMC has been one of those things that we've really struggled to understand.  What should the timings *really* be?  We've done the best we can to try to guess our way through it.  BUT, we could certainly be very wrong.  If a GPMC setting could cause these types of bugs, please let me know.  I'll be happy to post more info on how we're setting that up now.  In case not, I'll save the electrons and not spam it here.
>
>
I don't know the AMXX architecture that well but looking at the
crash-log, am not sure GPMC should play in role here.
What I think is, it is mostly memory corruption and can be caused by
many reasons as Tony outlined.

To ensure that, your memory is in good state, can you run memtester
for long duration and see that you
are not getting any memory failures. Try to give the maximum memory
size as a an input to memtester.

You can download one from [1]

Regards
Santosh
[1] http://pyropus.ca/software/memtester/
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  6:36       ` Shilimkar, Santosh
@ 2012-06-06  7:08         ` CF Adad
  2012-06-06  7:10         ` Tony Lindgren
  1 sibling, 0 replies; 20+ messages in thread
From: CF Adad @ 2012-06-06  7:08 UTC (permalink / raw)
  To: Shilimkar, Santosh; +Cc: Tony Lindgren, linux-omap

Hi Santosh,

Thanks for the comments!  If you do not think the GPMC can be a factor here, that's great news.  I'd certainly love to take something off the list.


Unfortunately, we've run the "memtester 200M" (we only have 256MB) in constant cycles for weekends at a time, and have not had it crash.  We did this back when we thought perhaps our RAM timings were wrong since the NANYA chip was not one mentioned in the source anywhere.  If we were having memory errors, I would suspect that would have caught them correct?  Should we run that with a larger number than 200?  We didn't want to dig into our operating system's space too much.


Thanks again!



----- Original Message -----
From: "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: Tony Lindgren <tony@atomide.com>; "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Wednesday, June 6, 2012 2:36 AM
Subject: Re: Please help! AM35xx mm/slab.c BUG

On Wed, Jun 6, 2012 at 11:44 AM, CF Adad <cfadad@rocketmail.com> wrote:
> All,
>
>
> We've learned a few more things:
>
> 1.) We have found a way to get it to happen pretty consistently.  We simply run iperf in a loop using the EMAC port to some other device.
>
>
> 2.) The crash ONLY happens on our custom board, not on the Twister dev kit.  This is true despite the fact that I ported our latest linux-omap 3.4-rc6 over there.  We're still running Technexion's default x-loader and u-boot to handle proper configs on that board. So, that's a substantial bit of code that is different between our boxes.  The kernel is altered only in that the few pinmux changes I left in Linux have been removed to avoid configuration differences between the two boards.
>
>
> This suggests that either:
> A) We have a hardware problem on our board.  Seems unlikely.  Can anyone think of anything hardware related that would manifest itself with these sorts of errors?
>
>
> B) We have a issue in our bootloader code somehwere.  I hesitated to overwrite the bootloaders for this test on the Twister baseboard just because I did not want to have to mess with getting the pinmux's and the like put back and such.
>
> Presuming something in those bootloaders is our problem, I wonder what EMAC-related stuff there really is.  For a long time we ran with our bootloaders NOT initializing either of the Eths.  This was Technexion's default.  They left that work to Linux.  We've recently done work to enable them in u-boot, but we were crashing like this long before that.  Once in Linux, we're just using the standard drivers and calls from within the board file to SMSC911x and the Davinci EMAC drivers.  I am using the patches that allow the e-fused MAC to be pulled from the AM35xx for the EMAC, but I can't see how that would cause this.
>
> Assuming the EMAC is perhaps an innocent bystander that happens just to cause this, the place I would have to suspect the most in our bootloaders would be the GPMC settings.  We've done a good bit of tweaking in there since we switched chips.  *Could a GPMC timing issue account for these types of errors???*  The reason I bring it up is that the GPMC has been one of those things that we've really struggled to understand.  What should the timings *really* be?  We've done the best we can to try to guess our way through it.  BUT, we could certainly be very wrong.  If a GPMC setting could cause these types of bugs, please let me know.  I'll be happy to post more info on how we're setting that up now.  In case not, I'll save the electrons and not spam it here.
>
>
I don't know the AMXX architecture that well but looking at the
crash-log, am not sure GPMC should play in role here.
What I think is, it is mostly memory corruption and can be caused by
many reasons as Tony outlined.

To ensure that, your memory is in good state, can you run memtester
for long duration and see that you
are not getting any memory failures. Try to give the maximum memory
size as a an input to memtester.

You can download one from [1]

Regards
Santosh
[1] http://pyropus.ca/software/memtester/

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  6:36       ` Shilimkar, Santosh
  2012-06-06  7:08         ` CF Adad
@ 2012-06-06  7:10         ` Tony Lindgren
  2012-06-06  7:51           ` CF Adad
  1 sibling, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2012-06-06  7:10 UTC (permalink / raw)
  To: Shilimkar, Santosh; +Cc: CF Adad, linux-omap

* Shilimkar, Santosh <santosh.shilimkar@ti.com> [120605 23:41]:
> 
> I don't know the AMXX architecture that well but looking at the
> crash-log, am not sure GPMC should play in role here.
> What I think is, it is mostly memory corruption and can be caused by
> many reasons as Tony outlined.

Bad GPMC timings can cause corruption on the smsc fifo, which can
cause random oopses especially with nfsroot.

I was seeing that on my zoom3 with nfsroot with bad muxing for GPMC
until we applied bce492c0 (ARM: OMAP2+: UART: Fix incorrect population of
default uart pads). That's easy to test by leaving out GPMC.
 
> To ensure that, your memory is in good state, can you run memtester
> for long duration and see that you
> are not getting any memory failures. Try to give the maximum memory
> size as a an input to memtester.

Yes and that should be left running for a few weeks on a group of
devices to verify things work once you think you have all the issues
fixed.

Regards,

Tony

 
> You can download one from [1]
> 
> Regards
> Santosh
> [1] http://pyropus.ca/software/memtester/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  7:10         ` Tony Lindgren
@ 2012-06-06  7:51           ` CF Adad
  2012-06-06  8:41             ` Tony Lindgren
  2012-06-07  9:32             ` Mohammed, Afzal
  0 siblings, 2 replies; 20+ messages in thread
From: CF Adad @ 2012-06-06  7:51 UTC (permalink / raw)
  To: Tony Lindgren, Shilimkar, Santosh; +Cc: linux-omap

Hi Tony,

Thanks again.  I'm really starting to think the GPMC almost has to be contributing.  In our setup, we have 2 separate versions of a baseboard for the TAM.  One version has the LAN9221 as a secondary Ethernet, while the other does not.  Otherwise, they're basically the same.  I can run the same code on both.  It seems from our testing that the vast majority of the crashes occur on the module with the LAN9221. Even with the SMSC just idle, it seems the one with that extra port is most likely to fail.


As I mentioned previously, I'm far from a GPMC expert.  I posted a thread to the TI E2E forums asking them for comments on my LAN9221 GPMC settings (http://e2e.ti.com/support/dsp/sitara_arm174_microprocessors/f/416/p/172157/690750.aspx#690750). I began to question what I had after some Google searches pulled up other patches which showed disparities between what I'm using and what others are using.


Do you folks know of a good reference for properly calculating these GPMC settings?  When I did the EMIF4 stuff, I was very pleased to find TI had a great set of resources online (http://processors.wiki.ti.com/index.php/Setting_up_AM35x_SDRC_registers), including an Excel spreadsheet that helped generate the register values for me.  I've found the following links on the GPMC, but have not been nearly as impressed as I was with the information on the EMIF4:
http://processors.wiki.ti.com/index.php/AM3517/05_GPMC_Subsystem

http://processors.wiki.ti.com/index.php/Ethernet_Connectivity_via_GPMC

http://e2e.ti.com/support/dsp/sitara_arm174_microprocessors/f/416/t/49394.aspx


The only other thing I know I have on the GPMC is the NAND flash, and it is taking the default settings along with several other boards in the AM35xx family (<x-loader>/include/asm/arch-omap3/mem.h):
/* for L3 = 165MHz & 16-bit NAND */
# define M_NAND_GPMC_CONFIG1 0x00001800
# define M_NAND_GPMC_CONFIG2 0x00080800
# define M_NAND_GPMC_CONFIG3 0x00080800
# define M_NAND_GPMC_CONFIG4 0x06000600
# define M_NAND_GPMC_CONFIG5 0x00070808
# define M_NAND_GPMC_CONFIG6 0x000003cf
# define M_NAND_GPMC_CONFIG7 0x00000848

Any thoughts on that?

Thanks again.  Your help is *greatly* appreciated.



----- Original Message -----
From: Tony Lindgren <tony@atomide.com>
To: "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
Cc: CF Adad <cfadad@rocketmail.com>; "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Wednesday, June 6, 2012 3:10 AM
Subject: Re: Please help! AM35xx mm/slab.c BUG

* Shilimkar, Santosh <santosh.shilimkar@ti.com> [120605 23:41]:
> 
> I don't know the AMXX architecture that well but looking at the
> crash-log, am not sure GPMC should play in role here.
> What I think is, it is mostly memory corruption and can be caused by
> many reasons as Tony outlined.

Bad GPMC timings can cause corruption on the smsc fifo, which can
cause random oopses especially with nfsroot.

I was seeing that on my zoom3 with nfsroot with bad muxing for GPMC
until we applied bce492c0 (ARM: OMAP2+: UART: Fix incorrect population of
default uart pads). That's easy to test by leaving out GPMC.

> To ensure that, your memory is in good state, can you run memtester
> for long duration and see that you
> are not getting any memory failures. Try to give the maximum memory
> size as a an input to memtester.

Yes and that should be left running for a few weeks on a group of
devices to verify things work once you think you have all the issues
fixed.

Regards,

Tony


> You can download one from [1]
> 
> Regards
> Santosh
> [1] http://pyropus.ca/software/memtester/

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  7:51           ` CF Adad
@ 2012-06-06  8:41             ` Tony Lindgren
  2012-06-06 10:37               ` Jarkko Nikula
  2012-06-07  9:32             ` Mohammed, Afzal
  1 sibling, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2012-06-06  8:41 UTC (permalink / raw)
  To: CF Adad; +Cc: Shilimkar, Santosh, linux-omap

* CF Adad <cfadad@rocketmail.com> [120606 00:55]:
> 
> Do you folks know of a good reference for properly calculating these GPMC settings?

In theory you just need to know the timings of connected components,
then check which ones depend on cycles and which ones depend on time.

Also take into account latencies added by level shifters if you have those.
Paul Walmsley noticed a few years ago that those affected the smsc911x
timings if not accounted for.

Probably the closest thing for reference we have are the various
arch/arm/mach-omap2/gpmc*.c files. Then of course verifying the timings
using a scope might help a lot.

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06  8:41             ` Tony Lindgren
@ 2012-06-06 10:37               ` Jarkko Nikula
  2012-06-06 15:53                 ` CF Adad
  0 siblings, 1 reply; 20+ messages in thread
From: Jarkko Nikula @ 2012-06-06 10:37 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: CF Adad, Shilimkar, Santosh, linux-omap

my 2 cents.

On 06/06/2012 11:41 AM, Tony Lindgren wrote:
> * CF Adad <cfadad@rocketmail.com> [120606 00:55]:
>>
>> Do you folks know of a good reference for properly calculating these GPMC settings?
> 
> In theory you just need to know the timings of connected components,
> then check which ones depend on cycles and which ones depend on time.
> 
I afraid paper-and-pencil gpmc exercise is often required but after that
it is more easy to see from charts if e.g. original settings were not
optimal or too near to edge. Helps to understand and point possible
problems on oscilloscope measurements too.

> Also take into account latencies added by level shifters if you have those.
> Paul Walmsley noticed a few years ago that those affected the smsc911x
> timings if not accounted for.
> 
I've noticed the same. Even one-directional level shifters easily add a
few ns and double amount in read operation since then there are two
level shifters in a path: one in clk/cs/oe/etc cpu-to-chip signal and
one on chip-to-cpu side.

Pay also attention if there are extra latencies in chip. Chip memory
reads/writes may be slower than chip register access (probably similar
than smsc fifo issue what Tony mentioned earlier in this thread).

-- 
Jarkko

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-06 10:37               ` Jarkko Nikula
@ 2012-06-06 15:53                 ` CF Adad
  0 siblings, 0 replies; 20+ messages in thread
From: CF Adad @ 2012-06-06 15:53 UTC (permalink / raw)
  To: Jarkko Nikula, Tony Lindgren; +Cc: Shilimkar, Santosh, linux-omap

All,

Thanks for all your input and time.  We'll try rolling our sleeves up and looking at these GPMC timings again.  Looking through those links I shared previously, it looks like the AM35xx GPMC may be setup a bit differently than the OMAP or DM.  What items should I be taking into account as we try to draft this up ourselves?  For instance, from (http://e2e.ti.com/support/dsp/sitara_arm174_microprocessors/f/416/t/49394.aspx), it appears that my L3 clock should be set to 166MHz and GPMC_FCLK should be 83MHz.  My source is new enough that it now includes the various patches to enable the MPU to run at 600MHz instead of 500MHz.  (This patch is part of that:  http://lists.denx.de/pipermail/u-boot/2012-February/117018.html.  There were more, but I couldn't find them quickly when pulling this email together.)  Does that adjust anything here with respect to these GPMC calculations, or is the L3 held steady so it does not matter?


After all that crashing yesterday and the day before, we setup some additional tests, and now it seems we can't break it...  This is precisely why we've had to keep hunting for this thing for so long.  It rears its ugly head and crashes us a ton, only to disappear back into the vapor.

Thanks again for all your help and time.  If we get any more points of interest I'll certainly share.



----- Original Message -----
From: Jarkko Nikula <jarkko.nikula@bitmer.com>
To: Tony Lindgren <tony@atomide.com>
Cc: CF Adad <cfadad@rocketmail.com>; "Shilimkar, Santosh" <santosh.shilimkar@ti.com>; "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Sent: Wednesday, June 6, 2012 6:37 AM
Subject: Re: Please help! AM35xx mm/slab.c BUG

my 2 cents.

On 06/06/2012 11:41 AM, Tony Lindgren wrote:
> * CF Adad <cfadad@rocketmail.com> [120606 00:55]:
>>
>> Do you folks know of a good reference for properly calculating these GPMC settings?
> 
> In theory you just need to know the timings of connected components,
> then check which ones depend on cycles and which ones depend on time.
> 
I afraid paper-and-pencil gpmc exercise is often required but after that
it is more easy to see from charts if e.g. original settings were not
optimal or too near to edge. Helps to understand and point possible
problems on oscilloscope measurements too.

> Also take into account latencies added by level shifters if you have those.
> Paul Walmsley noticed a few years ago that those affected the smsc911x
> timings if not accounted for.
> 
I've noticed the same. Even one-directional level shifters easily add a
few ns and double amount in read operation since then there are two
level shifters in a path: one in clk/cs/oe/etc cpu-to-chip signal and
one on chip-to-cpu side.

Pay also attention if there are extra latencies in chip. Chip memory
reads/writes may be slower than chip register access (probably similar
than smsc fifo issue what Tony mentioned earlier in this thread).

-- 
Jarkko

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Please help! AM35xx mm/slab.c BUG
  2012-06-06  7:51           ` CF Adad
  2012-06-06  8:41             ` Tony Lindgren
@ 2012-06-07  9:32             ` Mohammed, Afzal
  2012-06-07 19:50               ` CF Adad
  1 sibling, 1 reply; 20+ messages in thread
From: Mohammed, Afzal @ 2012-06-07  9:32 UTC (permalink / raw)
  To: CF Adad; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi "",

On Wed, Jun 06, 2012 at 13:21:23, CF Adad wrote:

> Thanks again.  I'm really starting to think the GPMC almost has to be contributing.

Does adding cycle2cycle delay / bus turnaround prevent the issue ?,

SMSC datasheet mentions about special restrictions on back to back read
and write-read, reading BYTE_TEST should take care of it, not sure whether
driver takes care of all scenarios as per datasheet.

Perhaps cycle2cycledelay would help us achieve it if driver doesn't
take care of it.

Regards
Afzal
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-07  9:32             ` Mohammed, Afzal
@ 2012-06-07 19:50               ` CF Adad
  2012-06-12 11:14                 ` Mohammed, Afzal
  0 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-07 19:50 UTC (permalink / raw)
  To: Mohammed, Afzal; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi Afzal,

Thanks for the suggestion.  This is something I was actually looking at just the other day.  I contemplated changing it, but have not done so yet.  I left it alone because I believe the folks that did all the core OMAP35x and LAN9221 timing and driver work knew way more than I about all this.  Now, if there is a difference between the way the OMAP35x is clocked or setup and the way my AM35x is clocked or setup, then I would expect these values to change.  So far however, I've not seen a difference, and I have not seen anyone else trying to wire up a 9221 to an AM35xx.  This confuses me.  Is the AM35xx not capable of supporting a 9221's timings for some reason?  If it can support it, why would it seem everyone is still using the slower 9220?  Even the TI reference materials suggest the LAN9220, http://processors.wiki.ti.com/index.php./Ethernet_Connectivity_via_GPMC.


I've had a thread (http://e2e.ti.com/support/dsp/sitara_arm174_microprocessors/f/416/p/172157/690750.aspx) sitting over at the TI E2E forums for a while now on this subject of GPMC, OMAP35x vs. AM35x, and the LAN9221.  So far, I've not received any feedback.  :-(

Thanks for your thoughts!  I may try fiddling a bit just to see if that helps.



----- Original Message -----
From: "Mohammed, Afzal" <afzal@ti.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>; Tony Lindgren <tony@atomide.com>; "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
Sent: Thursday, June 7, 2012 5:32 AM
Subject: RE: Please help! AM35xx mm/slab.c BUG

Hi "",

On Wed, Jun 06, 2012 at 13:21:23, CF Adad wrote:

> Thanks again.  I'm really starting to think the GPMC almost has to be contributing.

Does adding cycle2cycle delay / bus turnaround prevent the issue ?,

SMSC datasheet mentions about special restrictions on back to back read
and write-read, reading BYTE_TEST should take care of it, not sure whether
driver takes care of all scenarios as per datasheet.

Perhaps cycle2cycledelay would help us achieve it if driver doesn't
take care of it.

Regards
Afzal

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Please help! AM35xx mm/slab.c BUG
  2012-06-07 19:50               ` CF Adad
@ 2012-06-12 11:14                 ` Mohammed, Afzal
  2012-06-12 15:27                   ` CF Adad
                                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Mohammed, Afzal @ 2012-06-12 11:14 UTC (permalink / raw)
  To: CF Adad; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi,

On Fri, Jun 08, 2012 at 01:20:13, CF Adad wrote:

> Thanks for your thoughts!  I may try fiddling a bit just to see if that helps.

5 series of patches for gpmc modifications [1-5] has been posted,
In case it helps in resolving your issue, please let me know.

You will have to use the new interface to make use of runtime
calculation of smsc911x timing, [5.x] can be referred for how to
do board modifications for smsc911x (please comment out any other
gpmc peripheral initialization in your board code, seems you have
nand, comment out nand initialization, even nand can be made to
work with new changes, but avoiding it will probably reduce your
burden)

As your eth is 9221, either you can provide the timings based on it
from board file or apply [6] over [1-5] (base: 3.5-rc1)

Regards
Afzal

[1] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69501.html
[2] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69881.html
[3] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69891.html
[4] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69897.html
[5] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69917.html
[5.x] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69924.html

[6]
diff --git a/arch/arm/mach-omap2/gpmc-smsc911x.c b/arch/arm/mach-omap2/gpmc-smsc911x.c
index 4bfe721..34816b9 100644
--- a/arch/arm/mach-omap2/gpmc-smsc911x.c
+++ b/arch/arm/mach-omap2/gpmc-smsc911x.c
@@ -105,12 +105,12 @@ static void gpmc_smsc911x_timing(struct gpmc_time_ctrl *time_ctrl,
 {
        struct gpmc_timings t;
        /* SMSC 9220 timings */
-       unsigned tcycle_r = 165;
+       unsigned tcycle_r = 45;
        unsigned tcsl_r = 32;
        unsigned tcsh_r = 133;
        unsigned tcsdv_r = 30;
        unsigned tdoff_r = 9;
-       unsigned tcycle_w = 165;
+       unsigned tcycle_w = 45;
        unsigned tcsl_w = 32;
        unsigned tcsh_w = 133;
        unsigned tdsu_w = 7;

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-12 11:14                 ` Mohammed, Afzal
@ 2012-06-12 15:27                   ` CF Adad
  2012-06-14 17:28                   ` CF Adad
  2012-06-19  1:29                   ` CF Adad
  2 siblings, 0 replies; 20+ messages in thread
From: CF Adad @ 2012-06-12 15:27 UTC (permalink / raw)
  To: Mohammed, Afzal; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Afzal,

Thanks!  This set looks very interesting.  As I've been chasing this bug all week, I haven't had time to sync our build up to the latest tree.  So I'm still back at 3.4-rc6, and cannot directly patch this in a give it a try yet.  However, I'll certainly take a look as soon as I can.  My research on the issue so far has indicated that we may have a hardware issue not a software issue.  We're seeing some nasty transients pouring in from our power supply.  So, we're chasing that first.

Thanks again for the post and suggestions though.  I do want to get up to the 3.5 series as soon as I can.

Best regards!



----- Original Message -----
From: "Mohammed, Afzal" <afzal@ti.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>; Tony Lindgren <tony@atomide.com>; "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
Sent: Tuesday, June 12, 2012 7:14 AM
Subject: RE: Please help! AM35xx mm/slab.c BUG

Hi,

On Fri, Jun 08, 2012 at 01:20:13, CF Adad wrote:

> Thanks for your thoughts!  I may try fiddling a bit just to see if that helps.

5 series of patches for gpmc modifications [1-5] has been posted,
In case it helps in resolving your issue, please let me know.

You will have to use the new interface to make use of runtime
calculation of smsc911x timing, [5.x] can be referred for how to
do board modifications for smsc911x (please comment out any other
gpmc peripheral initialization in your board code, seems you have
nand, comment out nand initialization, even nand can be made to
work with new changes, but avoiding it will probably reduce your
burden)

As your eth is 9221, either you can provide the timings based on it
from board file or apply [6] over [1-5] (base: 3.5-rc1)

Regards
Afzal

[1] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69501.html
[2] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69881.html
[3] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69891.html
[4] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69897.html
[5] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69917.html
[5.x] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69924.html

[6]
diff --git a/arch/arm/mach-omap2/gpmc-smsc911x.c b/arch/arm/mach-omap2/gpmc-smsc911x.c
index 4bfe721..34816b9 100644
--- a/arch/arm/mach-omap2/gpmc-smsc911x.c
+++ b/arch/arm/mach-omap2/gpmc-smsc911x.c
@@ -105,12 +105,12 @@ static void gpmc_smsc911x_timing(struct gpmc_time_ctrl *time_ctrl,
{
        struct gpmc_timings t;
        /* SMSC 9220 timings */
-       unsigned tcycle_r = 165;
+       unsigned tcycle_r = 45;
        unsigned tcsl_r = 32;
        unsigned tcsh_r = 133;
        unsigned tcsdv_r = 30;
        unsigned tdoff_r = 9;
-       unsigned tcycle_w = 165;
+       unsigned tcycle_w = 45;
        unsigned tcsl_w = 32;
        unsigned tcsh_w = 133;
        unsigned tdsu_w = 7;
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-12 11:14                 ` Mohammed, Afzal
  2012-06-12 15:27                   ` CF Adad
@ 2012-06-14 17:28                   ` CF Adad
  2012-06-14 19:10                     ` jean-philippe francois
  2012-06-19  1:29                   ` CF Adad
  2 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-14 17:28 UTC (permalink / raw)
  To: Mohammed, Afzal; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

An update:

*LAN9221 and GPMC off the hook?*

We've isolated the GPMC away from this I believe by disabling the LAN9221 in both our bootloaders and the kernel and by booting everything off the SD cards only.  By removing it's initialization code from the respective board files, I *hope* we've basically removed it from contention.  Obviously the chip is still wired up, but I don't expect the bootloaders or kernel to be trying to talk to it.  Likewise, the NAND is being initialized, but we're not mouting or using it at all.

With these changes we're still seeing these crashes, albeit with the same incredible lack of frequency.

*EMAC now _partially_ on the hook?*
I posted a seperate thread on what I think may be a related subject, potential Davinci EMAC problems, here:  http://www.spinics.net/lists/linux-omap/msg71833.html.

As you can see from the crashes posted there, there seems to be a bit of whining from the EMAC driver.  Since performance in the EMAC <=> EMAC case has always been questionable anyway, any chance there is a tiny memory leak or something similar that could be contributing?

What about configuring this EMAC from within u-boot?  Could that initialization do something bad when we get into Linux?  I've not touched these drivers.  I've simply called them like other boards in the family are doing.

Just this morning, I upgraded to the latest linux-omap 3.5-rc2, but still saw one of these crashes pretty quickly...

*Power stability?*

We're learning through all of this that our boards do appear to have some funny transients running through the power circuits every so often.  The ones we've captured on the scope have not caused crashes or hard lockups, but they are there.  This could be a dumb question, but could power issues create a slab error like this???  I guess I'm more accustomed to seeing power issues result in more hard lock ups than a nicely worded dump with the kernel sometimes still somewhat functioning.


Can anyone suggest to me anything I may not have tried to get more information out of these crashes when they occur?

Thanks again to all!

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-14 17:28                   ` CF Adad
@ 2012-06-14 19:10                     ` jean-philippe francois
  2012-06-15  4:23                       ` CF Adad
  0 siblings, 1 reply; 20+ messages in thread
From: jean-philippe francois @ 2012-06-14 19:10 UTC (permalink / raw)
  To: CF Adad; +Cc: Mohammed, Afzal, linux-omap, Tony Lindgren, Shilimkar, Santosh

2012/6/14 CF Adad <cfadad@rocketmail.com>:
> An update:
>
> *LAN9221 and GPMC off the hook?*
>
> We've isolated the GPMC away from this I believe by disabling the LAN9221 in both our bootloaders and the kernel and by booting everything off the SD cards only.  By removing it's initialization code from the respective board files, I *hope* we've basically removed it from contention.  Obviously the chip is still wired up, but I don't expect the bootloaders or kernel to be trying to talk to it.  Likewise, the NAND is being initialized, but we're not mouting or using it at all.
>
> With these changes we're still seeing these crashes, albeit with the same incredible lack of frequency.
>
> *EMAC now _partially_ on the hook?*
> I posted a seperate thread on what I think may be a related subject, potential Davinci EMAC problems, here:  http://www.spinics.net/lists/linux-omap/msg71833.html.
>
> As you can see from the crashes posted there, there seems to be a bit of whining from the EMAC driver.  Since performance in the EMAC <=> EMAC case has always been questionable anyway, any chance there is a tiny memory leak or something similar that could be contributing?
>
> What about configuring this EMAC from within u-boot?  Could that initialization do something bad when we get into Linux?  I've not touched these drivers.  I've simply called them like other boards in the family are doing.
>
> Just this morning, I upgraded to the latest linux-omap 3.5-rc2, but still saw one of these crashes pretty quickly...
>
> *Power stability?*
>
> We're learning through all of this that our boards do appear to have some funny transients running through the power circuits every so often.  The ones we've captured on the scope have not caused crashes or hard lockups, but they are there.  This could be a dumb question, but could power issues create a slab error like this???  I guess I'm more accustomed to seeing power issues result in more hard lock ups than a nicely worded dump with the kernel sometimes still somewhat functioning.
>
>
I am following this bug with interest, because we often go the "custom
hardware" way, and have faced situation like these.
In my opinion, random memory corruption is more than often the sign of an
hardware design issue. EMAC here is perhaps only a symptom, because
it provides the proper memory bandwith and power consumption pattern that
triggers the glitch.

How many board do you have ? Are some more stable than others ?
Can you solder additional caps on top of your power decoupling caps ?
Can you tweak the voltages ?

> Can anyone suggest to me anything I may not have tried to get more information out of these crashes when they occur?
>
> Thanks again to all!
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-14 19:10                     ` jean-philippe francois
@ 2012-06-15  4:23                       ` CF Adad
  0 siblings, 0 replies; 20+ messages in thread
From: CF Adad @ 2012-06-15  4:23 UTC (permalink / raw)
  To: jean-philippe francois
  Cc: Mohammed, Afzal, linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi Jean-Philippe,

Thanks for the notes. I agree whole-heartedly with the idea that we're having a hardware issue. I'm just trying to eliminate any other software-related issues if I can. Since I'm really just a software engineer, I have to leave the hardware work to the hardware folks. All I can do is try to make sure the software is as rock solid as i can. Frankly however, I think we're running out of places to look in software, provided you folks do not know of any outstanding issues similar to ours in the core AM35xx support at the moment.

The *only* thing that seems to stand in contrast to the notion of this being a hardware problem is that we have *finally* managed to see a few of these crashes on our development kit. Since that kit contains none of our hardware, that was pretty surprising. That platform however, is certainly _exponentially_ more stable than ours at the moment, which does support the idea of this being mostly or perhaps completely a hardware issue.

Our design is using a COM developed by another vendor and a custom baseboard. As noted in another post, our power has been "glitchy" since our first prototype baseboards arrived. Our board uses a standard, PC-grade power supply that we selected from a very reputable brand due to its very low stated minimum loads. Obviously AM35xx processors, even with a load of attachements, do not demand much relative to a standard PC. We expected that power to be clean due to the reputation of the supply manufacturer and the fact we met the minimum specs. So we basically wired its 5V and 3.3V rails directly into our board and the AM3517-based COM with only customary decoupling, nothing special. The trouble has been that those vendor specs appear to be a bit "dreamy" to say the least. We've measured a number of variances in the supply, and have had to install some loading resistors on the rails for now just to ease an otherwise very nasty ripple. So we do have all that
 going on and expect that could the core of this. More recently, we also started detecting high frequency transients that seem to show up on all the rails at 20 - 30 minute intervals or during heavy usage. These signals are in the 80 - 100MHz range, and exist for only a few pulses. Though the board appears to keep trucking when they occur, there's no way those can be good. We've yet to determine where those are coming from. So, all those things are being looked at very closely right now.

To answer your questions:  We do have a few boards in the hands of a few engineers, and all of them are seeing similar stability and performance issues, again very sporadically. Since they appear for a while and then disappear for a much longer while, it's been incredibly hard to characterize in any sense of the word. Regarding adding capacitance or other filtering, our hardware engineer is looking at that right now. As far as the voltages are concerned, I don't believe we have a lot of control over the 5V, 3.3V, etc. as they are basically just sourced directly from the supply.

Regarding the EMAC:  Has anyone else got an pair of AM3517-based somethings laying around that they may have run iperf between?  If direct connected with either a crossover cable or a decent 100Mbps switch, can anyone get the full expected 85+Mbps one would expect with good TCP on a 100Mbps link?  As noted in the other post, the best we can do between our EMACs is rough 50 - 70Mbps. If we connect that same EMAC, through the same switch, to a non-EMAC NIC, we fly right up to the full 85+Mbps mark and stay there. It's incredibly odd... I'd be very interested in knowing what others are seeing.

Thanks again for your comments!




----- Original Message -----
From: jean-philippe francois <jp.francois@cynove.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "Mohammed, Afzal" <afzal@ti.com>; "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>; Tony Lindgren <tony@atomide.com>; "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
Sent: Thursday, June 14, 2012 3:10 PM
Subject: Re: Please help! AM35xx mm/slab.c BUG

2012/6/14 CF Adad <cfadad@rocketmail.com>:
> An update:
>
> *LAN9221 and GPMC off the hook?*
>
> We've isolated the GPMC away from this I believe by disabling the LAN9221 in both our bootloaders and the kernel and by booting everything off the SD cards only.  By removing it's initialization code from the respective board files, I *hope* we've basically removed it from contention.  Obviously the chip is still wired up, but I don't expect the bootloaders or kernel to be trying to talk to it.  Likewise, the NAND is being initialized, but we're not mouting or using it at all.
>
> With these changes we're still seeing these crashes, albeit with the same incredible lack of frequency.
>
> *EMAC now _partially_ on the hook?*
> I posted a seperate thread on what I think may be a related subject, potential Davinci EMAC problems, here:  http://www.spinics.net/lists/linux-omap/msg71833.html.
>
> As you can see from the crashes posted there, there seems to be a bit of whining from the EMAC driver.  Since performance in the EMAC <=> EMAC case has always been questionable anyway, any chance there is a tiny memory leak or something similar that could be contributing?
>
> What about configuring this EMAC from within u-boot?  Could that initialization do something bad when we get into Linux?  I've not touched these drivers.  I've simply called them like other boards in the family are doing.
>
> Just this morning, I upgraded to the latest linux-omap 3.5-rc2, but still saw one of these crashes pretty quickly...
>
> *Power stability?*
>
> We're learning through all of this that our boards do appear to have some funny transients running through the power circuits every so often.  The ones we've captured on the scope have not caused crashes or hard lockups, but they are there.  This could be a dumb question, but could power issues create a slab error like this???  I guess I'm more accustomed to seeing power issues result in more hard lock ups than a nicely worded dump with the kernel sometimes still somewhat functioning.
>
>
I am following this bug with interest, because we often go the "custom
hardware" way, and have faced situation like these.
In my opinion, random memory corruption is more than often the sign of an
hardware design issue. EMAC here is perhaps only a symptom, because
it provides the proper memory bandwith and power consumption pattern that
triggers the glitch.

How many board do you have ? Are some more stable than others ?
Can you solder additional caps on top of your power decoupling caps ?
Can you tweak the voltages ?

> Can anyone suggest to me anything I may not have tried to get more information out of these crashes when they occur?
>
> Thanks again to all!
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please help! AM35xx mm/slab.c BUG
  2012-06-12 11:14                 ` Mohammed, Afzal
  2012-06-12 15:27                   ` CF Adad
  2012-06-14 17:28                   ` CF Adad
@ 2012-06-19  1:29                   ` CF Adad
  2012-06-19  6:29                     ` Mohammed, Afzal
  2 siblings, 1 reply; 20+ messages in thread
From: CF Adad @ 2012-06-19  1:29 UTC (permalink / raw)
  To: Mohammed, Afzal; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi Afzal,

I just wanted to follow back up.  We are still trying to find this slab bug, but so far we've not found any smoking guns.  Less than stable power is our current main suspect. As has been the custom however, the error was here 2 weeks ago nearly non-stop, and then as suddenly as it came it went nearly silent.  We've had a very hard time reproducing it since!


Anyway, we have advanced our kernel to today's latest l-o (3.5-rc2). Though we are not considering the GPMC a likely source of the error at this moment, I'm considering exploring this patchset.  Unfortunately, the NAND is very critical to our current efforts.  So I'm trying to find a time where it would be OK to disable it as you suggested to try this.  Since these values are being set now in Linux, do I need to rework my bootloaders as well?  In my current case, these settings are all done in u-boot. I do not believe Linux did anything with them.  Do I need to remove those in order to use your patches?  If I do, do I not lose access to those things while in the bootloaders?


Thanks



----- Original Message -----
From: "Mohammed, Afzal" <afzal@ti.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>; Tony Lindgren <tony@atomide.com>; "Shilimkar, Santosh" <santosh.shilimkar@ti.com>
Sent: Tuesday, June 12, 2012 7:14 AM
Subject: RE: Please help! AM35xx mm/slab.c BUG

Hi,

On Fri, Jun 08, 2012 at 01:20:13, CF Adad wrote:

> Thanks for your thoughts!  I may try fiddling a bit just to see if that helps.

5 series of patches for gpmc modifications [1-5] has been posted,
In case it helps in resolving your issue, please let me know.

You will have to use the new interface to make use of runtime
calculation of smsc911x timing, [5.x] can be referred for how to
do board modifications for smsc911x (please comment out any other
gpmc peripheral initialization in your board code, seems you have
nand, comment out nand initialization, even nand can be made to
work with new changes, but avoiding it will probably reduce your
burden)

As your eth is 9221, either you can provide the timings based on it
from board file or apply [6] over [1-5] (base: 3.5-rc1)

Regards
Afzal

[1] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69501.html
[2] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69881.html
[3] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69891.html
[4] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69897.html
[5] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69917.html
[5.x] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69924.html

[6]
diff --git a/arch/arm/mach-omap2/gpmc-smsc911x.c b/arch/arm/mach-omap2/gpmc-smsc911x.c
index 4bfe721..34816b9 100644
--- a/arch/arm/mach-omap2/gpmc-smsc911x.c
+++ b/arch/arm/mach-omap2/gpmc-smsc911x.c
@@ -105,12 +105,12 @@ static void gpmc_smsc911x_timing(struct gpmc_time_ctrl *time_ctrl,
{
        struct gpmc_timings t;
        /* SMSC 9220 timings */
-       unsigned tcycle_r = 165;
+       unsigned tcycle_r = 45;
        unsigned tcsl_r = 32;
        unsigned tcsh_r = 133;
        unsigned tcsdv_r = 30;
        unsigned tdoff_r = 9;
-       unsigned tcycle_w = 165;
+       unsigned tcycle_w = 45;
        unsigned tcsl_w = 32;
        unsigned tcsh_w = 133;
        unsigned tdsu_w = 7;
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: Please help! AM35xx mm/slab.c BUG
  2012-06-19  1:29                   ` CF Adad
@ 2012-06-19  6:29                     ` Mohammed, Afzal
  0 siblings, 0 replies; 20+ messages in thread
From: Mohammed, Afzal @ 2012-06-19  6:29 UTC (permalink / raw)
  To: CF Adad; +Cc: linux-omap, Tony Lindgren, Shilimkar, Santosh

Hi,

On Tue, Jun 19, 2012 at 06:59:42, CF Adad wrote:

> Anyway, we have advanced our kernel to today's latest l-o (3.5-rc2). Though we are not considering the GPMC a likely source of the error at this moment, I'm considering exploring this patchset.  Unfortunately, the NAND is very critical to our current efforts.  So I'm trying to find a time where it would be OK to disable it as you suggested to try this.  Since these values are being set now in Linux, do I need to rework my bootloaders as well?  In my current case, these settings are all done in u-boot. I do not believe Linux did anything with them.  Do I need to remove those in order to use your patches?  If I do, do I not lose access to those things while in the bootloaders?

Please sent me your board file, I will modify it to use the
new interface with NAND relying on bootloader settings and
SMSC relying on Kernel.

No bootloader modifications are required.

Regards
Afzal
 
> [1] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69501.html
> [2] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69881.html
> [3] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69891.html
> [4] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69897.html
> [5] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69917.html
> [5.x] http://www.mail-archive.com/linux-omap@vger.kernel.org/msg69924.html
> 
> [6]
> diff --git a/arch/arm/mach-omap2/gpmc-smsc911x.c b/arch/arm/mach-omap2/gpmc-smsc911x.c
> index 4bfe721..34816b9 100644
> --- a/arch/arm/mach-omap2/gpmc-smsc911x.c
> +++ b/arch/arm/mach-omap2/gpmc-smsc911x.c
> @@ -105,12 +105,12 @@ static void gpmc_smsc911x_timing(struct gpmc_time_ctrl *time_ctrl,
> {
>         struct gpmc_timings t;
>         /* SMSC 9220 timings */
> -       unsigned tcycle_r = 165;
> +       unsigned tcycle_r = 45;
>         unsigned tcsl_r = 32;
>         unsigned tcsh_r = 133;
>         unsigned tcsdv_r = 30;
>         unsigned tdoff_r = 9;
> -       unsigned tcycle_w = 165;
> +       unsigned tcycle_w = 45;
>         unsigned tcsl_w = 32;
>         unsigned tcsh_w = 133;
>         unsigned tdsu_w = 7;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-06-19  6:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-05  6:37 Please help! AM35xx mm/slab.c BUG CF Adad
2012-06-05  7:08 ` Tony Lindgren
2012-06-05 16:29   ` CF Adad
2012-06-06  6:14     ` CF Adad
2012-06-06  6:36       ` Shilimkar, Santosh
2012-06-06  7:08         ` CF Adad
2012-06-06  7:10         ` Tony Lindgren
2012-06-06  7:51           ` CF Adad
2012-06-06  8:41             ` Tony Lindgren
2012-06-06 10:37               ` Jarkko Nikula
2012-06-06 15:53                 ` CF Adad
2012-06-07  9:32             ` Mohammed, Afzal
2012-06-07 19:50               ` CF Adad
2012-06-12 11:14                 ` Mohammed, Afzal
2012-06-12 15:27                   ` CF Adad
2012-06-14 17:28                   ` CF Adad
2012-06-14 19:10                     ` jean-philippe francois
2012-06-15  4:23                       ` CF Adad
2012-06-19  1:29                   ` CF Adad
2012-06-19  6:29                     ` Mohammed, Afzal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.