linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.4.8aa1
@ 2001-08-11 12:58 Andrea Arcangeli
  2001-08-11 13:32 ` 2.4.8aa1 Eyal Lebedinsky
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2001-08-11 12:58 UTC (permalink / raw)
  To: linux-kernel

Only in 2.4.8pre8aa1: 40_blkdev-pagecache-11
Only in 2.4.8aa1: 40_blkdev-pagecache-12

	Fixed a missing drop_super and some other detail related to
	superblock handling (superblock cannot go away anymore under
	us after a get_super, so some piece of code is been cleaned
	up).

Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-11 12:58 2.4.8aa1 Andrea Arcangeli
@ 2001-08-11 13:32 ` Eyal Lebedinsky
  2001-08-11 14:02   ` 2.4.8aa1 Andrea Arcangeli
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2001-08-11 13:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli wrote:
> 
> Only in 2.4.8pre8aa1: 40_blkdev-pagecache-11
> Only in 2.4.8aa1: 40_blkdev-pagecache-12

Everything possible selected as a module.

The relevant DRM part of .config:

CONFIG_DRM=y
CONFIG_DRM_TDFX=m   
CONFIG_DRM_GAMMA=m
CONFIG_DRM_R128=m 
CONFIG_DRM_RADEON=m  
CONFIG_DRM_I810=m
CONFIG_DRM_MGA=m

gcc -D__KERNEL__ -I/data2/usr/local/src/linux-2.4/include -Wall
-Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer
-fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2
-march=i686 -malign-functions=4  -DMODULE -DMODVERSIONS -include
/data2/usr/local/src/linux-2.4/include/linux/modversions.h   -c -o
r128_drv.o r128_drv.c
In file included from r128_drv.c:36:
ati_pcigart.h: In function `r128_ati_pcigart_init':
ati_pcigart.h:114: structure has no member named `virtual'
make[3]: *** [r128_drv.o] Error 1
make[3]: Leaving directory
`/data2/usr/local/src/linux-2.4/drivers/char/drm'

--
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.anu.edu.au/eyal/>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-11 13:32 ` 2.4.8aa1 Eyal Lebedinsky
@ 2001-08-11 14:02   ` Andrea Arcangeli
  2001-08-11 18:20     ` 2.4.8aa1 Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2001-08-11 14:02 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-kernel, Linus Torvalds

On Sat, Aug 11, 2001 at 11:32:46PM +1000, Eyal Lebedinsky wrote:
> Andrea Arcangeli wrote:
> > 
> > Only in 2.4.8pre8aa1: 40_blkdev-pagecache-11
> > Only in 2.4.8aa1: 40_blkdev-pagecache-12
> 
> Everything possible selected as a module.
> 
> The relevant DRM part of .config:
> 
> CONFIG_DRM=y
> CONFIG_DRM_TDFX=m   
> CONFIG_DRM_GAMMA=m
> CONFIG_DRM_R128=m 
> CONFIG_DRM_RADEON=m  
> CONFIG_DRM_I810=m
> CONFIG_DRM_MGA=m
> 
> gcc -D__KERNEL__ -I/data2/usr/local/src/linux-2.4/include -Wall
> -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer
> -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2
> -march=i686 -malign-functions=4  -DMODULE -DMODVERSIONS -include
> /data2/usr/local/src/linux-2.4/include/linux/modversions.h   -c -o
> r128_drv.o r128_drv.c
> In file included from r128_drv.c:36:
> ati_pcigart.h: In function `r128_ati_pcigart_init':
> ati_pcigart.h:114: structure has no member named `virtual'
> make[3]: *** [r128_drv.o] Error 1
> make[3]: Leaving directory
> `/data2/usr/local/src/linux-2.4/drivers/char/drm'

This is the same problem I mentioned yesterday to the list. Nobody
should ever use page->virtual directly, it's not there in -aa when
highmem is disabled to save memory and increase performance, if it would
be in C or python it would be a private field of the class to make it
explicit (nitpicking, in python __ just rename and it's techincally
still visible from the outside of the class).

page_address(page) must be used instead of page->virtual.

Anyways here an incremental patch that will fix your compile problem
(Linus please include):

--- 2.4.8aa1/drivers/char/drm/ati_pcigart.h.~1~	Sat Aug 11 15:54:25 2001
+++ 2.4.8aa1/drivers/char/drm/ati_pcigart.h	Sat Aug 11 15:54:55 2001
@@ -111,7 +111,7 @@
 	memset( pci_gart, 0, ATI_MAX_PCIGART_PAGES * sizeof(u32) );
 
 	for ( i = 0 ; i < pages ; i++ ) {
-		page_base = virt_to_bus( entry->pagelist[i]->virtual );
+		page_base = virt_to_bus( page_address(entry->pagelist[i]) );
 		for (j = 0; j < (PAGE_SIZE / ATI_PCIGART_PAGE_SIZE); j++) {
 			*pci_gart++ = cpu_to_le32( page_base );
 			page_base += ATI_PCIGART_PAGE_SIZE;
--- 2.4.8aa1/drivers/char/drm/r128_cce.c.~1~	Sat Aug 11 08:04:05 2001
+++ 2.4.8aa1/drivers/char/drm/r128_cce.c	Sat Aug 11 15:55:51 2001
@@ -351,10 +351,10 @@
 		page_ofs = tmp_ofs >> PAGE_SHIFT;
 
 		R128_WRITE( R128_PM4_BUFFER_DL_RPTR_ADDR,
-			    virt_to_bus(entry->pagelist[page_ofs]->virtual));
+			    virt_to_bus(page_address(entry->pagelist[page_ofs])));
 
 		DRM_DEBUG( "ring rptr: offset=0x%08lx handle=0x%08lx\n",
-			   virt_to_bus(entry->pagelist[page_ofs]->virtual),
+			   virt_to_bus(page_address(entry->pagelist[page_ofs])),
 			   entry->handle + tmp_ofs );
 	}
 
--- 2.4.8aa1/drivers/char/drm/radeon_cp.c.~1~	Sat Aug 11 08:04:05 2001
+++ 2.4.8aa1/drivers/char/drm/radeon_cp.c	Sat Aug 11 15:56:33 2001
@@ -624,10 +624,10 @@
 		page_ofs = tmp_ofs >> PAGE_SHIFT;
 
 		RADEON_WRITE( RADEON_CP_RB_RPTR_ADDR,
-			      virt_to_bus(entry->pagelist[page_ofs]->virtual));
+			      virt_to_bus(page_address(entry->pagelist[page_ofs])));
 
 		DRM_DEBUG( "ring rptr: offset=0x%08lx handle=0x%08lx\n",
-			   virt_to_bus(entry->pagelist[page_ofs]->virtual),
+			   virt_to_bus(page_address(entry->pagelist[page_ofs])),
 			   entry->handle + tmp_ofs );
 	}
 

Also please include this below one too that I just addressed previously
for my own compilations.  (Eyal, you don't need to apply the below one
of course)

--- 2.4.8pre7aa1/drivers/char/drm/drm_vm.h.~1~	Thu Aug  9 01:37:46 2001
+++ 2.4.8pre7aa1/drivers/char/drm/drm_vm.h	Thu Aug  9 01:42:22 2001
@@ -107,7 +107,7 @@
 	if( !pmd_present( *pmd ) ) return NOPAGE_OOM;
 	pte = pte_offset( pmd, i );
 	if( !pte_present( *pte ) ) return NOPAGE_OOM;
-	physical = (unsigned long)pte_page( *pte )->virtual;
+	physical = (unsigned long)page_address(pte_page( *pte ));
 	atomic_inc(&virt_to_page(physical)->count); /* Dec. by kernel */
 
 	DRM_DEBUG("0x%08lx => 0x%08lx\n", address, physical);

Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-11 14:02   ` 2.4.8aa1 Andrea Arcangeli
@ 2001-08-11 18:20     ` Linus Torvalds
  2001-08-12 17:02       ` 2.4.8aa1 Andrea Arcangeli
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2001-08-11 18:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Eyal Lebedinsky, linux-kernel


Andrea,
 mind cleaning this up a bit and not just papering over the horridness?

On Sat, 11 Aug 2001, Andrea Arcangeli wrote:
>
> This is the same problem I mentioned yesterday to the list. Nobody
> should ever use page->virtual directly, it's not there in -aa when
> highmem is disabled to save memory and increase performance, if it would
> be in C or python it would be a private field of the class to make it
> explicit (nitpicking, in python __ just rename and it's techincally
> still visible from the outside of the class).
>
> page_address(page) must be used instead of page->virtual.

It would be good to instead adding the functions

	unsigned long pte_to_pfn(pte_t pte)
	{
		.. architecture-specific in asm/pgtable.h ..
	}

	/*
	 * struct page -> "page frame number", ie
	 * physical page number.
	 */
	unsigned long page_to_pfn(struct page *page)
	{
		zone_t zone = page->zone;
		return (page - zone->zone_mem_map) + zone->zone_start_mapnr;
	}

	unsigned long long page_to_bus(struct page *page)
	{
		return (unsigned long long) phys_to_bus(page_to_pfn(page) << PAGE_SHIFT;
	}

and using those? As it is right now, drivers that _could_ use up to 4GB of
bus addresses simply cannot do it, because there is no good way to get the
high physical addresses from a "struct page".

(You can do the above by hand, of course, but device driver writers really
shouldn't know about the internals of page zone handling).

The things that you changed were all

	virt_to_bus( page_address (...) )

which really is rather distateful in that it artificially limits itself to
only lowmem code, and gets randomly incorrect values for anything else.

> @@ -107,7 +107,7 @@
>  	if( !pmd_present( *pmd ) ) return NOPAGE_OOM;
>  	pte = pte_offset( pmd, i );
>  	if( !pte_present( *pte ) ) return NOPAGE_OOM;
> -	physical = (unsigned long)pte_page( *pte )->virtual;
> +	physical = (unsigned long)page_address(pte_page( *pte ));
>  	atomic_inc(&virt_to_page(physical)->count); /* Dec. by kernel */

That is just too ugly for words. It's not a "physical" address at all, but
a virtual one, and we're getting the virtual address from the struct page
just to get back to the struct page.

It should really do

	struct page *page = pte_page( *pte );
	get_page(page);

instead..

		Linus


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-11 18:20     ` 2.4.8aa1 Linus Torvalds
@ 2001-08-12 17:02       ` Andrea Arcangeli
  2001-08-12 17:21         ` 2.4.8aa1 Linus Torvalds
  2001-08-13  3:29         ` 2.4.8aa1 David S. Miller
  0 siblings, 2 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2001-08-12 17:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eyal Lebedinsky, linux-kernel

On Sat, Aug 11, 2001 at 11:20:07AM -0700, Linus Torvalds wrote:
> 
> Andrea,
>  mind cleaning this up a bit and not just papering over the horridness?

I'm sorry but while fixing the compilation troubles I didn't spent one
millisecond looking at what the code was trying to achieve despite it
looked doing something silly, I was just doing a blind
sed/page->virtual/page_address(page)/ and I cared to communicate the
community to never use page->virtual but to always use page_address()
instead. Infact you obviously didn't looked at such horrid code either
when you included it between 2.4.8pre6 and 2.4.8pre7 (not that I pretend
you to look at every driver update you include of course, this is not a
remark, it's just a matter of fact):

diff -urN 2.4.8pre6/drivers/char/drm/drm_vm.h
2.4.8pre7/drivers/char/drm/drm_vm.h
--- 2.4.8pre6/drivers/char/drm/drm_vm.h	Thu Jan  1 01:00:00 1970
+++ 2.4.8pre7/drivers/char/drm/drm_vm.h	Wed Aug  8 22:06:34 2001
[..]
+	physical = (unsigned long)pte_page( *pte )->virtual;
+	atomic_inc(&virt_to_page(physical)->count); /* Dec. by kernel */
[..]

BTW, I think it would be nice if we also avoid to waste ram and x86 ram
bandwith with page->virtual in mainline where it's not needed, here the
patch (possibly other archs could use it too by simply defining the
CONFIG_NO_PAGE_VIRTUAL to Y when highmem is not defined, but I thought
it was mostly important for x86 lowend):

diff -urN 2.4.6pre5/arch/i386/config.in novirtual/arch/i386/config.in
--- 2.4.6pre5/arch/i386/config.in	Thu Jun 21 08:03:30 2001
+++ novirtual/arch/i386/config.in	Thu Jun 21 16:02:11 2001
@@ -165,6 +165,9 @@
    define_bool CONFIG_HIGHMEM y
    define_bool CONFIG_X86_PAE y
 fi
+if [ "$CONFIG_NOHIGHMEM" = "y" ]; then
+   define_bool CONFIG_NO_PAGE_VIRTUAL y
+fi
 
 bool 'Math emulation' CONFIG_MATH_EMULATION
 bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
diff -urN 2.4.6pre5/include/asm-i386/pgtable.h novirtual/include/asm-i386/pgtable.h
--- 2.4.6pre5/include/asm-i386/pgtable.h	Thu Jun 14 18:07:49 2001
+++ novirtual/include/asm-i386/pgtable.h	Thu Jun 21 16:02:11 2001
@@ -255,7 +255,11 @@
  * Permanent address of a page. Obviously must never be
  * called on a highmem page.
  */
+#ifdef CONFIG_NO_PAGE_VIRTUAL
+#define page_address(page) __va((page - mem_map) << PAGE_SHIFT)
+#else
 #define page_address(page) ((page)->virtual)
+#endif
 #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
 
 /*
diff -urN 2.4.6pre5/include/linux/mm.h novirtual/include/linux/mm.h
--- 2.4.6pre5/include/linux/mm.h	Thu Jun 21 08:03:56 2001
+++ novirtual/include/linux/mm.h	Thu Jun 21 16:02:33 2001
@@ -160,8 +160,10 @@
 	wait_queue_head_t wait;		/* Page locked?  Stand in line... */
 	struct page **pprev_hash;	/* Complement to *next_hash. */
 	struct buffer_head * buffers;	/* Buffer maps us to a disk block. */
+#ifndef CONFIG_NO_PAGE_VIRTUAL
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
+#endif
 	struct zone_struct *zone;	/* Memory zone we are in. */
 } mem_map_t;
 
diff -urN 2.4.6pre5/mm/page_alloc.c novirtual/mm/page_alloc.c
--- 2.4.6pre5/mm/page_alloc.c	Thu Jun 21 08:03:57 2001
+++ novirtual/mm/page_alloc.c	Thu Jun 21 16:02:11 2001
@@ -851,8 +851,10 @@
 		for (i = 0; i < size; i++) {
 			struct page *page = mem_map + offset + i;
 			page->zone = zone;
+#ifndef CONFIG_NO_PAGE_VIRTUAL
 			if (j != ZONE_HIGHMEM)
 				page->virtual = __va(zone_start_paddr);
+#endif
 			zone_start_paddr += PAGE_SIZE;
 		}
 

> On Sat, 11 Aug 2001, Andrea Arcangeli wrote:
> >
> > This is the same problem I mentioned yesterday to the list. Nobody
> > should ever use page->virtual directly, it's not there in -aa when
> > highmem is disabled to save memory and increase performance, if it would
> > be in C or python it would be a private field of the class to make it
> > explicit (nitpicking, in python __ just rename and it's techincally
> > still visible from the outside of the class).
> >
> > page_address(page) must be used instead of page->virtual.
> 
> It would be good to instead adding the functions
> 
> 	unsigned long pte_to_pfn(pte_t pte)
> 	{
> 		.. architecture-specific in asm/pgtable.h ..
> 	}
> 
> 	/*
> 	 * struct page -> "page frame number", ie
> 	 * physical page number.
> 	 */
> 	unsigned long page_to_pfn(struct page *page)
> 	{
> 		zone_t zone = page->zone;
> 		return (page - zone->zone_mem_map) + zone->zone_start_mapnr;
> 	}
> 
> 	unsigned long long page_to_bus(struct page *page)
> 	{
> 		return (unsigned long long) phys_to_bus(page_to_pfn(page) << PAGE_SHIFT;
> 	}
> 
> and using those? As it is right now, drivers that _could_ use up to 4GB of
> bus addresses simply cannot do it, because there is no good way to get the
> high physical addresses from a "struct page".
> 
> (You can do the above by hand, of course, but device driver writers really
> shouldn't know about the internals of page zone handling).
> 
> The things that you changed were all
> 
> 	virt_to_bus( page_address (...) )
> 
> which really is rather distateful in that it artificially limits itself to
> only lowmem code, and gets randomly incorrect values for anything else.

virt_to_bus(page_address(pte_page(pte))) actually returns the right bus
address on x86 in a range of 4G, so the first part of the highmem (99%
of boxes where those drivers are needed I guess) gets covered too (of
course also virt_to_bus is limited to 32bit in its reval in first place
so anything physical above 4g breaks on the 32bit boxes). In 2.2 instead
virt_to_bus malfunctions on _all_ the highmem "virtual addresses" (if we
can call ""virtual addresses"" the wrap around).

But of course calling page_address() on anything highmem is ugly like
hell even if it ""incidentally"" works on 2.4 up to 4G (it temporarly
wraps around in userspace).

Furthmore the worst bug here is that calling virt_to_bus will break
badly on alpha or any other 64bit arch. virt_to_bus should never used
and it should been undefined by davem when he introduced in the iommu
API so people would stop doing those mistakes even today in a 2.4.8
driver update. The pci api on x86 will work up to 4G (as well as the
virt_to_bus(page_address(pte_page(pte))) and then it will break because
there's not iommu on x86, while on the real 64bit platforms it works for
all the physical ram, but nobody checks for the reval so the iommu
capable platforms are prone to break too.

Also the fact we pass virtual addresses to the pci_map API is quite
an ugly API for the 32bit platforms, same ugliness of
virt_to_bus(page_address(pte_page(pte))) on the memory between 1G and
4G, it was really only supposed to get the lowmem virtual addresses
infact (as virt_to_bus after all).

For your suggestion of the pte_to_pfn and the reverse I'm not sure why
they're needed (of course for the arch is extremely simple to implement
pte_to_pfn, on x86 is a shitright of 12, on alpha it's a shiftright of
32, much simpler than pte_page, pte_page has to deal with discontigmem
on numa system so it's slower). But in general people shouldn't use
pte_page in drivers, they shouldn't walk pagetables, the vm of linux
walks pagetables instead. We need pte_page for things like ptrace,
flushing of dirty pagetables etc... when we have virtual userspace
addresses and we must find the pages we are working with. But a device
driver can either get the page structure via the kiobuf if it's using
the direct I/O "pinning userspace memory way" or allocate the memory
itself inside the driver so it will have to deal again only with an arry
of page structures in first place, never with pte_page. So any usage of
pte_page in drivers looks a bit suspect on the design side.

Now about your page_to_bus suggestion I think it's on the right track,
virt_*anything* has to die, we cannot deal with kernel virtual addresses
on pci data any longer, people should either use pfn or page structures
(page structures is the most generic and preferred way I believe because
drivers should only deal with page structures as said above).

So I believe a kind of page_to_bus64 should be implemented, and it should
possibly return dma64_addr_t typedeffed as 'unsigned long long'.

But the real problem is how to have the driver understand that on some
arch it should keep using the iommu 32bit api instead of DAC because
those archs have crappy PCI64 DAC hardware implementation that inhibits
pci prefetch cycles or stuff like that that leads to bad performance
(DaveM can certainly provide more details). This is the hard part of the
API I believe. So we either need the driver to have two explicit cases
(then the plain page_to_bus64 would not need other changes), or to hide
the iommu stuff inside the page_to_bus64, but in such case it means we
also need to have a second function to release the pte ala pci_unmap_*.

BTW, at the moment what the quadrics and myri folks are doing on alpha
in 2.4 is just to use the monster window by hand and to use DAC where
avaialble, we enabled it by default on the tsunami chipsets for that
reason (for those drivers that maps huge amounts of ram it's manadatory
not to generate pressure on the iommu pagetables or they can destabilize
the system since all drivers are buggy and they don't check for iommu
map functions faliures). The myri folks told me if the monster window is
not available they now have an high bound of 32mbytes of data in flight
again to avoid those stability troubles and that looks fine.

And what the driver should ever do when its device is 32bit and it gets
a bus address out of page_to_bus64 that overflows 32bit and no iommu is
available? To handle that case it should either enforce that condition
not to happen by design (by allocating its pages without __GFP_HIGHMEM
if it's using the mmap callback, of course with map_user_kiobuf directIO
way it's not possible to enforce it by design instead) or to preallocate
some lowmem page to do the bounce buffers (note: it cannot allocate the
lowmem page on the fly if it is in a path that could be used to release
memory, like I/O or it could deadlock, the memory reservation should be
provided by the VM as Stephen suggested, Ben has some patches for that.
The advantage of the having the reservation provided by the VM rather
than having each subsystem preallocating a minimal amount of ram that
allows it to make progress regardless of GFP [like highmem.c currently
does] is that the VM has the knowledge to still allow those "reserved"
pages to be used as "clean unmapped cache", so they would not be wasted,
but it must not be "clean cache" as it is now, it should really be "not
locked" and "unmapped" atomically freeable cache, as soon as somebody
wants to do I/O or map the page in userspace, the page should be refiled
and replaced, and possibly the I/O should wait for some other memory to
be freed before the lock/map on the page can succeed, so it's *way* more
complicated than what I seen implemented so far by Ben if we want
something reliable).

> > @@ -107,7 +107,7 @@
> >  	if( !pmd_present( *pmd ) ) return NOPAGE_OOM;
> >  	pte = pte_offset( pmd, i );
> >  	if( !pte_present( *pte ) ) return NOPAGE_OOM;
> > -	physical = (unsigned long)pte_page( *pte )->virtual;
> > +	physical = (unsigned long)page_address(pte_page( *pte ));
> >  	atomic_inc(&virt_to_page(physical)->count); /* Dec. by kernel */
> 
> That is just too ugly for words. It's not a "physical" address at all, but
> a virtual one, and we're getting the virtual address from the struct page
> just to get back to the struct page.
> 
> It should really do
> 
> 	struct page *page = pte_page( *pte );
> 	get_page(page);
> 
> instead..

Indeed ;)

Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-12 17:02       ` 2.4.8aa1 Andrea Arcangeli
@ 2001-08-12 17:21         ` Linus Torvalds
  2001-08-12 18:49           ` 2.4.8aa1 Andrea Arcangeli
  2001-08-13  3:29         ` 2.4.8aa1 David S. Miller
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2001-08-12 17:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Eyal Lebedinsky, linux-kernel


On Sun, 12 Aug 2001, Andrea Arcangeli wrote:
>
> virt_to_bus(page_address(pte_page(pte))) actually returns the right bus
> address on x86 in a range of 4G,

No it doesn't.

page_address(page) is (page)->virtual, which is going to be zero for
non-kmapped pages, and going to ve the virtual mapping of a kmapped page.
NEITHER of which will translate correctly with "virt_to_bus()".

In short, "virt_to_bus(page_address(pte_page(pte)))" only works for the
first 1GB. Always have, always will.

> But of course calling page_address() on anything highmem is ugly like
> hell even if it ""incidentally"" works on 2.4 up to 4G (it temporarly
> wraps around in userspace).

It incidentally DOES NOT WORK.

It happens to work for you in the non-highmem case when you disabled
"page->virtual", but in that case HIGHMEM obviously doesn't work at all,
so you cannot actually ever _use_ anything but the low 1GB anyway, so it
ends up not working in practice even then.

So stop making these things up.

> For your suggestion of the pte_to_pfn and the reverse I'm not sure why
> they're needed (of course for the arch is extremely simple to implement
> pte_to_pfn, on x86 is a shitright of 12, on alpha it's a shiftright of
> 32, much simpler than pte_page, pte_page has to deal with discontigmem
> on numa system so it's slower). But in general people shouldn't use
> pte_page in drivers, they shouldn't walk pagetables, the vm of linux
> walks pagetables instead.

I agree. However, if you looked at how I used it, it was just as a helper
function for doing the "struct page -> physical" and "struct page -> bus"
address calculations, which _are_ valid.

Think of the "page_to_pfn()" as an internal helper routine that avoids the
overflow thing.

But also, there is actually at least one useful case in the VM layer that
wouldn't mind having it - look at what the VALID_PAGE() users in
mm/memory.c, and look at the code we generate now (first we do
"pte_page()", then we take the page and turn it into a PFN, and then we
compare that PFN against the highest PFN we support - and all without
realizing that we already _had_ the PFN in the page table and could have
avoided one conversion altogether).

Also, look at remap_pte_range: it would actually be a _lot_ better off
using PFN's instead of the current physical addresses. Right now it cannot
handle 64-bit remapping on x86, and the reason is again that we don't have
good conversion functions..

> So I believe a kind of page_to_bus64 should be implemented, and it should
> possibly return dma64_addr_t typedeffed as 'unsigned long long'.

Fair enough, that sounds like a good idea too.

> But the real problem is how to have the driver understand that on some
> arch it should keep using the iommu 32bit api instead of DAC because
> those archs have crappy PCI64 DAC hardware implementation that inhibits
> pci prefetch cycles or stuff like that that leads to bad performance
> (DaveM can certainly provide more details). This is the hard part of the
> API I believe. So we either need the driver to have two explicit cases
> (then the plain page_to_bus64 would not need other changes), or to hide
> the iommu stuff inside the page_to_bus64, but in such case it means we
> also need to have a second function to release the pte ala pci_unmap_*.

Yes. And we should call them something like "pci_kmap()", "pci_kunmap()",
because that is exactly what they'd be.

> And what the driver should ever do when its device is 32bit and it gets
> a bus address out of page_to_bus64 that overflows 32bit and no iommu is
> available?

It shouldn't use page_to_bus64(), I believe. That function only makes
sense for 64-bit aware drivers.

32-bit drivers could still use the regular PCI mapping functions, which
will map it down to 32 bits as far as the driver is concerned, even if the
"real" unmapped bus address might be >32 bits. Agreed?

		Linus


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-12 17:21         ` 2.4.8aa1 Linus Torvalds
@ 2001-08-12 18:49           ` Andrea Arcangeli
  0 siblings, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2001-08-12 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eyal Lebedinsky, linux-kernel

On Sun, Aug 12, 2001 at 10:21:58AM -0700, Linus Torvalds wrote:
> 
> On Sun, 12 Aug 2001, Andrea Arcangeli wrote:
> >
> > virt_to_bus(page_address(pte_page(pte))) actually returns the right bus
> > address on x86 in a range of 4G,
> 
> No it doesn't.
> 
> page_address(page) is (page)->virtual, which is going to be zero for
> non-kmapped pages, and going to ve the virtual mapping of a kmapped page.
> NEITHER of which will translate correctly with "virt_to_bus()".
> 
> In short, "virt_to_bus(page_address(pte_page(pte)))" only works for the
> first 1GB. Always have, always will.

Woops sorry, I was thinking at the 2.2 page_address, the 2.2 virt_to_bus
won't handle the warp around while the 2.4 does, in some bigmem patches
we used the 2.4 virt_to_bus just to allow
virt_to_bus(page_address(pte_page(pte))) to work up to 4G, this is why I
got mistaken. Of course in mainline 2.4 page_address(pte_page(pte)) is
just broken for any highmem page as you said.

Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-12 17:02       ` 2.4.8aa1 Andrea Arcangeli
  2001-08-12 17:21         ` 2.4.8aa1 Linus Torvalds
@ 2001-08-13  3:29         ` David S. Miller
  2001-08-13  5:44           ` 2.4.8aa1 Linus Torvalds
  1 sibling, 1 reply; 9+ messages in thread
From: David S. Miller @ 2001-08-13  3:29 UTC (permalink / raw)
  To: torvalds; +Cc: andrea, eyal, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Sun, 12 Aug 2001 10:21:58 -0700 (PDT)
   
   On Sun, 12 Aug 2001, Andrea Arcangeli wrote:
   > So I believe a kind of page_to_bus64 should be implemented, and it should
   > possibly return dma64_addr_t typedeffed as 'unsigned long long'.
   
   Fair enough, that sounds like a good idea too.
   
I agree with the dma64_addr_t type, but I don't know if I agree
with the logistics of how drivers go about doing things and what
the interfaces are that an architecture implements.

First, maybe I'm confused about intent.  Are we trying to just hack
together something quick for 2.4.x that allows PCI drivers on highmem
machines to get at the complete low 4GB of physical memory, even the
highmem parts?

If this is the case, then I have no strong opinions about what you do
because I know I will be able to retrofit my ports into whatever
interfaces you come up with for that (f.e. the pci_kmap()/pci_kunmap()
stuff).

But if this is more far reaching, ie. a 64-bit addressing API for
drivers, my thinking is that this is a larger problem to solve.

For example, if you're talking seriously about sending page/off/len
triplets to the drivers (which is what we want in the end), then with
that I can whip together something really nice and portable in about
a weekend of thinking and hacking.

The only reason I haven't done this work yet is strictly because the
drivers aren't getting pages yet (well, the zerocopy aware networking
drivers are).  It was my understanding that this is the bit that Jens
Axboe and Andrea are working on.

Later,
David S. Miller
davem@redhat.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.4.8aa1
  2001-08-13  3:29         ` 2.4.8aa1 David S. Miller
@ 2001-08-13  5:44           ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2001-08-13  5:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: andrea, eyal, linux-kernel


On Sun, 12 Aug 2001, David S. Miller wrote:
>
> First, maybe I'm confused about intent.  Are we trying to just hack
> together something quick for 2.4.x that allows PCI drivers on highmem
> machines to get at the complete low 4GB of physical memory, even the
> highmem parts?

No, it's more about cleaning up stuff for 2.4.x - the drivers affected
right now can't use more than 1GB of ram anyway, but as they were written
they would not compile in Andrea's tree due to having a bit too intimate a
knowledge of the VM system.

For 2.5.x, we'll see where we end up going in the long run.

		Linus


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-08-13  5:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-11 12:58 2.4.8aa1 Andrea Arcangeli
2001-08-11 13:32 ` 2.4.8aa1 Eyal Lebedinsky
2001-08-11 14:02   ` 2.4.8aa1 Andrea Arcangeli
2001-08-11 18:20     ` 2.4.8aa1 Linus Torvalds
2001-08-12 17:02       ` 2.4.8aa1 Andrea Arcangeli
2001-08-12 17:21         ` 2.4.8aa1 Linus Torvalds
2001-08-12 18:49           ` 2.4.8aa1 Andrea Arcangeli
2001-08-13  3:29         ` 2.4.8aa1 David S. Miller
2001-08-13  5:44           ` 2.4.8aa1 Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).