linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* alloc_area_pte: page already exists
@ 2001-08-09 13:32 Bjorn Wesen
  2001-08-09 16:03 ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-09 13:32 UTC (permalink / raw)
  To: linux-kernel

I'm trying to track down a problem which seems to be a race condition
somewhere, involving a driver using kiobuf's (on Linux 2.4). The driver
does the usual stuff like this

	if((ret = alloc_kiovec(1, &myreqbuf)))
		goto out;
	
	if((ret = map_user_kiobuf(READ, myreqbuf,
				  req_u,
				  sizeof(struct my_request)))) {
		free_kiovec(1, &myreqbuf);
		goto out;
	}

and it works 9999 out of 10000 times but sometimes alloc_kiovec fails
inside its child calls (vmalloc -> alloc_area_pte) with

alloc_area_pte: page already exists

that is, for some reason the master page table (init_mm's) becomes
unsynced with the vmalloc lists so vmalloc tries to insert into a position
where something already is mapped.

I was just wondering if someone here knows a typical way this
desyncing could arise (in the style of "this could be a race in the
vmalloc page table delayed PTE copying", or "you must never
call free_kiovec in an interrupt context" etc..)

I'm not saying it's a standard kernel bug, it most probably is a bug in
the driver I'm writing or in our Linux port (arch/cris) but maybe someone
has seen this before and knows what could be the cause. 

thanks,
Bjorn








^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 16:03 ` Andrea Arcangeli
@ 2001-08-09 15:12   ` Bjorn Wesen
  2001-08-09 16:36     ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-09 15:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 9 Aug 2001, Andrea Arcangeli wrote:
> > I'm trying to track down a problem which seems to be a race condition
> > vmalloc page table delayed PTE copying", or "you must never
> > call free_kiovec in an interrupt context" etc..)
> 
> The latter is certainly the case. The former could be the case, however

Had some nasty bugs in that case before.. 

> if you know you're running any free_kiovec from irqs remove it and try
> again first.

Yes, I found it. The irq which signals end-of-I/O also frees the kiovec,
which is bad. But it seemed that what would happen then would be deadlocks
(due to the spinlocks) and not races - but I'll try to make it free in a
task or something instead (there is no kernel context for the "calling"
process because the driver is asynchronous so the context which did the
alloc_kiovec etc. has exited when the irq comes later) and see if it works
better.

> BTW, you should also avoid all the kiobuf allocation in all the fast
> paths, try to pre-allocate it in a slow path to deliver higher
> performance. (shortly we'll split the bh/blocks array out of the kiobuf

Ok, that's probably not a problem since the allocation is done in a normal
system-call context.

/Bjorn


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 16:36     ` Andrea Arcangeli
@ 2001-08-09 15:33       ` Bjorn Wesen
  2001-08-09 17:04         ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-09 15:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 9 Aug 2001, Andrea Arcangeli wrote:
> > task or something instead (there is no kernel context for the "calling"
> > process because the driver is asynchronous so the context which did the
> > alloc_kiovec etc. has exited when the irq comes later) and see if it works
> > better.
> 
> Ok.

I realised I'm not entirely sure on if it's ok to do such "dangerous"
functions inside, say, tq_immediate using queue_task even ? Doesn't that
run in the interrupt context also, upon exit of the interrupt before going
back ?

I haven't followed the latest bh/tasklet/softirq development really..

IOW I want the irq to "trigger" the freeing of the iovecs but it's ok if
it's done later, as long as it's done without any races :)

BTW how does vfree cope with not walking all tasks pgd's to remove the
relevant pte's ? Doesn't that give exactly this kind of problem ? (pte's
are there despite them being deallocated)

/Bjorn



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 13:32 alloc_area_pte: page already exists Bjorn Wesen
@ 2001-08-09 16:03 ` Andrea Arcangeli
  2001-08-09 15:12   ` Bjorn Wesen
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2001-08-09 16:03 UTC (permalink / raw)
  To: Bjorn Wesen; +Cc: linux-kernel

On Thu, Aug 09, 2001 at 03:32:44PM +0200, Bjorn Wesen wrote:
> I'm trying to track down a problem which seems to be a race condition
> vmalloc page table delayed PTE copying", or "you must never
> call free_kiovec in an interrupt context" etc..)

The latter is certainly the case. The former could be the case, however
if you know you're running any free_kiovec from irqs remove it and try
again first.

BTW, you should also avoid all the kiobuf allocation in all the fast
paths, try to pre-allocate it in a slow path to deliver higher
performance. (shortly we'll split the bh/blocks array out of the kiobuf
in a data structure called kiobuf_io so the non-IO usage will avoid the
overhead of the bh/blocks arrays and it will become lighter but still
it won't be that light)

Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 15:12   ` Bjorn Wesen
@ 2001-08-09 16:36     ` Andrea Arcangeli
  2001-08-09 15:33       ` Bjorn Wesen
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2001-08-09 16:36 UTC (permalink / raw)
  To: Bjorn Wesen; +Cc: linux-kernel

On Thu, Aug 09, 2001 at 05:12:43PM +0200, Bjorn Wesen wrote:
> Yes, I found it. The irq which signals end-of-I/O also frees the kiovec,
> which is bad. But it seemed that what would happen then would be deadlocks
> (due to the spinlocks) and not races - but I'll try to make it free in a

In SMP the most obvious sympthom would be a deadlock right, but maybe
you're testing with a kernel compiled UP?

The locking there looks quite sane.

> task or something instead (there is no kernel context for the "calling"
> process because the driver is asynchronous so the context which did the
> alloc_kiovec etc. has exited when the irq comes later) and see if it works
> better.

Ok.

> > BTW, you should also avoid all the kiobuf allocation in all the fast
> > paths, try to pre-allocate it in a slow path to deliver higher
> > performance. (shortly we'll split the bh/blocks array out of the kiobuf
> 
> Ok, that's probably not a problem since the allocation is done in a normal
> system-call context.

Ok.

Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 15:33       ` Bjorn Wesen
@ 2001-08-09 17:04         ` Andrea Arcangeli
  2001-08-09 21:50           ` Bjorn Wesen
  2001-08-12 12:16           ` Bjorn Wesen
  0 siblings, 2 replies; 10+ messages in thread
From: Andrea Arcangeli @ 2001-08-09 17:04 UTC (permalink / raw)
  To: Bjorn Wesen; +Cc: linux-kernel

On Thu, Aug 09, 2001 at 05:33:27PM +0200, Bjorn Wesen wrote:
> I realised I'm not entirely sure on if it's ok to do such "dangerous"
> functions inside, say, tq_immediate using queue_task even ? Doesn't that
> run in the interrupt context also, upon exit of the interrupt before going
> back ?

Yes it does, also you should use tasklets, never tq_immediate, these days
(tasklets can run in parallel and they're serialized only against
themself).

> IOW I want the irq to "trigger" the freeing of the iovecs but it's ok if
> it's done later, as long as it's done without any races :)

Your design looks suspect, but you can do that safely (at least as far
as vfree is concerned) with keventd's schedule_task().

> BTW how does vfree cope with not walking all tasks pgd's to remove the
> relevant pte's ? Doesn't that give exactly this kind of problem ? (pte's

vfree as usual walks the pgd/pmd to reach the pte. It knows the
pgd/pmd/pte cannot go away and it serlializes against vmalloc with the
vmlist_lock, it sounds ok.

Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 17:04         ` Andrea Arcangeli
@ 2001-08-09 21:50           ` Bjorn Wesen
  2001-08-12  9:44             ` Bjorn Wesen
  2001-08-12 12:16           ` Bjorn Wesen
  1 sibling, 1 reply; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-09 21:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 9 Aug 2001, Andrea Arcangeli wrote:
> > IOW I want the irq to "trigger" the freeing of the iovecs but it's ok if
> > it's done later, as long as it's done without any races :)
> 
> Your design looks suspect, but you can do that safely (at least as far
> as vfree is concerned) with keventd's schedule_task().

Ok will try

The design is shit but needed in this application... 

> > BTW how does vfree cope with not walking all tasks pgd's to remove the
> > relevant pte's ? Doesn't that give exactly this kind of problem ? (pte's
> 
> vfree as usual walks the pgd/pmd to reach the pte. It knows the
> pgd/pmd/pte cannot go away and it serlializes against vmalloc with the
> vmlist_lock, it sounds ok.

So what happens when the kernel accesses the non-existant pte's or when
the vmalloc space runs out ?

/Bjorn


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 21:50           ` Bjorn Wesen
@ 2001-08-12  9:44             ` Bjorn Wesen
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-12  9:44 UTC (permalink / raw)
  To: linux-kernel

On Thu, 9 Aug 2001, Bjorn Wesen wrote:
> > vfree as usual walks the pgd/pmd to reach the pte. It knows the
> > pgd/pmd/pte cannot go away and it serlializes against vmalloc with the
> > vmlist_lock, it sounds ok.
> 
> So what happens when the kernel accesses the non-existant pte's or when
> the vmalloc space runs out ?

Just for the record, let me answer myself:

When the delayed vmalloc pagetable copying activates during such a
pagefault, the individual PTE's are not copied, but just the pointer to
the PTE container page is inserted into the pgd (or pmd, for 3-level). 

So any pointers from the pgd in non-init processes are simply to the
corresponding pmd and pte container in the init_mm, thus vfree can
remove the PTE's, flush the tlb and bob's your uncle. Too bad there are
not any comments at all in the code to mention design issues like this.

Back to another theory on why my vmalloc pgtables screw up :)

/Bjorn



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-09 17:04         ` Andrea Arcangeli
  2001-08-09 21:50           ` Bjorn Wesen
@ 2001-08-12 12:16           ` Bjorn Wesen
  2001-08-12 21:43             ` Urban Widmark
  1 sibling, 1 reply; 10+ messages in thread
From: Bjorn Wesen @ 2001-08-12 12:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 9 Aug 2001, Andrea Arcangeli wrote:
> > IOW I want the irq to "trigger" the freeing of the iovecs but it's ok if
> > it's done later, as long as it's done without any races :)
> 
> Your design looks suspect, but you can do that safely (at least as far
> as vfree is concerned) with keventd's schedule_task().

I looked into this and I can't figure out how to actually use this. It
seems as if schedule_task doesn't take a copy of the incoming tq_struct
but actually uses it as-is - and this poses a problem of creating the
tq_struct in the first place.

You can't kmalloc it, because then the task called by the tq_struct will
need to kfree itself and I'm pretty sure the kernel won't enjoy that.

You can't have it on the stack because the frame will exit long before
it's used.

You can't have it static because you might need to enqueue more than one
(with different "arguments" possibly)

In fact, looking at how it's used in the kernel today highlights the
problem: drivers/char/tty_io.c:

1866  * The tq handling here is a little racy - tty->SAK_tq may already be
queued.
1867  * But there's no mechanism to fix that without futzing with
tqueue_lock.
1868  * Fortunately we don't need to worry, because if ->SAK_tq is already
queued,
1869  * the values which we write to it will be identical to the values
which it
1870  * already has. --akpm
1871  */
1872 void do_SAK(struct tty_struct *tty)
1873 {
1874         PREPARE_TQUEUE(&tty->SAK_tq, __do_SAK, tty);
1875         schedule_task(&tty->SAK_tq);
1876 }

I'm obviously in a loosing situation here :(

/Bjorn


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: alloc_area_pte: page already exists
  2001-08-12 12:16           ` Bjorn Wesen
@ 2001-08-12 21:43             ` Urban Widmark
  0 siblings, 0 replies; 10+ messages in thread
From: Urban Widmark @ 2001-08-12 21:43 UTC (permalink / raw)
  To: Bjorn Wesen; +Cc: Andrea Arcangeli, linux-kernel

On Sun, 12 Aug 2001, Bjorn Wesen wrote:

> I looked into this and I can't figure out how to actually use this. It
> seems as if schedule_task doesn't take a copy of the incoming tq_struct
> but actually uses it as-is - and this poses a problem of creating the
> tq_struct in the first place.
> 
> You can't kmalloc it, because then the task called by the tq_struct will
> need to kfree itself and I'm pretty sure the kernel won't enjoy that.

I think you can kmalloc it as __run_task_queue (assuming that is the only
consumer of tasks ...) moves the queued tasks to a private list and then
never references an element after the 'routine' of that element has been
called.

smbfs does this in fs/smbfs/sock.c (and if that's wrong I'd like to know :)
and so does reiserfs in fs/reiserfs/journal.c.

/Urban


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2001-08-12 21:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-09 13:32 alloc_area_pte: page already exists Bjorn Wesen
2001-08-09 16:03 ` Andrea Arcangeli
2001-08-09 15:12   ` Bjorn Wesen
2001-08-09 16:36     ` Andrea Arcangeli
2001-08-09 15:33       ` Bjorn Wesen
2001-08-09 17:04         ` Andrea Arcangeli
2001-08-09 21:50           ` Bjorn Wesen
2001-08-12  9:44             ` Bjorn Wesen
2001-08-12 12:16           ` Bjorn Wesen
2001-08-12 21:43             ` Urban Widmark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).