From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gateway.prograde.net ([66.92.163.78] helo=sol.prograde.net) by canuck.infradead.org with esmtp (Exim 4.72 #1 (Red Hat Linux)) id 1Plnfv-0003Eh-9j for linux-mtd@lists.infradead.org; Sat, 05 Feb 2011 19:20:04 +0000 Subject: Re: Numonyx NOR and chip->mutex bug? Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: text/plain; charset=us-ascii From: Michael Cashwell In-Reply-To: Date: Sat, 5 Feb 2011 14:19:59 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <76241A6C-ECA0-46B1-B8B4-F3786FCBA339@prograde.net> References: <16826B66-31FE-41AD-A6EF-E668A45AF1FE@prograde.net> <25631ED7-C6A0-44B1-B33D-F48DC48C812E@prograde.net> <626D0191-85FC-41E2-94C7-CBFF9D9629BE@prograde.net> <6FC0E416-EEBD-453F-AAB9-88BB6D90BFAB@prograde.net> <4D4AD9ED.8060104@keymile.com> <4D4B37D4.4050204@keymile.com> <4D4BDD48.6040600@keymile.com> <541E19B8-D428-4F59-B6BB-A3BD8F455AE4@prograde.net> To: Joakim Tjernlund Cc: Holger brunck , linux-mtd@lists.infradead.org, stefan.bigler@keymile.com List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Feb 5, 2011, at 5:29 AM, Joakim Tjernlund wrote: > Michael Cashwell wrote on 2011/02/04 18:09:01: >>=20 >> That's where I get confused. It's true that another thread could do = that but wouldn't it have to bracket its operation with calls to = get_chip() and put_chip()? Those actually do the suspend and resume. I'm = certainly seeing erases being suspended to do a buffered write and then = being resumed. >=20 > buffered write do get_chip that suspends the erase, then a = WAIT_TIMEOUT which drops the locks and gives the suspended erase a = chance to wakeup and run but now SR is holding the status for the write = so the suspended erase are testing the wrong status, do you agree? It's not that simple. Just waking the suspended erase thread should not = cause it to "run" the way you are saying. That thread *should* wake up in inval_cache_and_wait_for_operation() = inside its while (chip->state !=3D chip_state) loop. The purpose of that = loop is to see if something else is going on with the chip and if so = just go back to sleep. That should prevent what you are saying is = happening and if it is not then we need to figure out why. In the case of a buffered write the erase thread should see chip->state = as either FL_WRITING_TO_BUFFER (if after the 0xE8) or FL_WRITING (if = after the 0xD0). Since those don't match the expected FL_ERASING state = the thread should stay in the while loop and go back to sleep on the = wait queue. It should keep doing that until the chip->state has come = back to FL_ERASING. That should not happen until the write function is = finished and does its put_chip and erase resume. So what I'm saying is that just dropping the locks and letting other = threads run is not, by itself itself, wrong. The code's supposed to do = that. It's just that the other threads are supposed check the conditions = when they wake up and only proceed if the conditions are correct. But I do see something suspicious. I think it can be argued that all CFI = thread wake-ups need to do the while (chip->state !=3D chip_state) dance = and go back to sleep if not. That seems to be the case for inval_cache_and_wait_for_operation() but = what about the other places where threads can be scheduled out? Won't = they just wake up where they are or am I missing some wait queue magic? IOW: We should look for other places outside of = inval_cache_and_wait_for_operation() that call things like msleep, = schedule, and cond_resched. (One sneaky one is cfi_udelay as it may call = cond_resched.) If any of those fail to do the chip->state !=3D chip_state sleep loop = when they awaken then we have have found our problem. One last idea is to add more info in your logging code: ubimkvol /dev/ubi0 -s 6MiB -N test1 [2299][209] do_erase_oneblock start adr=3D0x00020000 len=3D0x20000 [2299][209] map_write 0x50 to 0x00020000 [2299][209] map_write 0x20 to 0x00020000 [2299][209] map_write 0xd0 to 0x00020000 [2299][465] map_write 0xb0 to 0x03fe5000 [2307][465] erase suspend 1 adr=3D0x03fe5000 [2307][465] map_write 0x70 to 0x03fe5000 [2307][465] map_write 0xe8 to 0x03fe5000 [2307][209] map_write 0x70 to 0x00020000 [2307][209] map_write 0x50 to 0x00020000 [2307][209] map_write 0xd0 to 0x00020000 [2307][209] map_write 0x70 to 0x00020000 [2311][209] erase resumed 2b adr=3D0x00020000 [2319][209] do_erase_oneblock end adr=3D0x00020000 len=3D0x20000 [2319][465] map_write 0x1ff to 0x03fe5000 [2319][465] map_write 0xc03c0000 to 0x03fe5000 [2319][465] map_write 0xc03c0000 to 0x03fe5002 ... [2327][465] map_write 0xc03c0000 to 0x03fe53fc [2327][465] map_write 0xc03c0000 to 0x03fe53fe [2327][465] map_write 0xd0 to 0x03fe5000 [3387][465] map_write 0x50 to 0x03fe5000 [3387][465] map_write 0x70 to 0x03fe5000 [3395][465] 50000000.flash: buffer write error status 0xb0 = adr=3D0x03fe5400 len=3D0x0 (0x400) when "erase resumed" happens, it should be inside the put_chip() done = when the write command is finished. Something to confirm. It would be helpful if the log also showed start/end for = do_buffer_write() and would print chip->state at various points since = that value controls what threads do when they wake up. > I won't worry about this until you test my patch and still have a = problem :) > Please test it. I tested it quickly on Friday and it didn't fix my problem, but we may = be fighting different problems. I also can't be sure the .c file didn't have other changes. I was really = pressed for time that day. My plan is to retest your patch against a pristine copy of the file and = report back. Unfortunately this has to wait until Monday as I don't have = access to the hardware over the weekend. -Mike