On Tue, 16 Mar 2010, Carlos O'Donell wrote: > After some of my own testing I think this is all MMU related, but I > can't prove it yet. I'm pouring through as much kernel code as I can > right now to determine what is going wrong at the time of the clone, > and I see at least one bug that I'm investigating regarding return > addresses. I've attached another version of the minifail test program. In this one, the parent and thread both monitor the location 0x4000001c in the stack region allocated to the thread. If a problem is detected, they drop core with an illegal instruction. If the child of the fork sees a nonzero value in the above location when the fork call returns, it sleeps for ten seconds. When corruption occurs and core is dropped on my c3750 (UP 32-bit kernel), both the parent and thread have undergone many iterations of their respective monitor loops. The forked child always reports seeing a nonzero value at the stack location. The before value in the core dump was zero (i.e., thread_run had not started). I added an illegal instruction abort to the child. In this case, the thread_run loop counter was 48085 when the page was copied and the before value was zero. One thought that has crossed my mind is that the memory pages allocated for the stack region used by the thread are somehow getting interchanged between parent and child by the fork operation. This happens fairly late as both the parent and thread are executing post fork at the time this happens. Possibly, this is part of the bug. I have looked at entry.S and pacache.S quite a bit and it's not obvious how this could happen, although I must admit to not fully understanding the tmp alias code. I tend to think the bug is in the core mm code. I see a few cleanups to entry.S. We didn't kill the misnamed macros (DEP, DEPI and EXTR) for example. But I don't think these are the problem. Dave -- J. David Anglin dave.anglin@nrc-cnrc.gc.ca National Research Council of Canada (613) 990-0752 (FAX: 952-6602)