All of lore.kernel.org
 help / color / mirror / Atom feed
* [DISCUSSION] Hexagon code inside kernel
@ 2013-02-15 14:28 cotulla
       [not found] ` <CAHrUA364XES66kXhr0Gg1dh_MQBAS0+R8Q4x+EY3dgz6s=QRww@mail.gmail.com>
  0 siblings, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-15 14:28 UTC (permalink / raw)
  To: linux-hexagon

Hello,


Some time ago (I think inside 3.4) Hexagon architecture support was added into Linux kernel.
Hexagon (another name is QDSP6) is a special DSP processor which was developed by Qualcomm Inc.
Qualcomm provided a set of patches to include Hexagon support inside Linux kernel.
QDSP6 is used inside a lot of different SoC (QDS8650B, MSM8960, APQ8064, etc) which are very common inside modern smartphones.  


However after deep looking inside those patches I found that this code is only supposed to be run on top of some mythic "Hexagon Virtual Machine".
"Hexagon Virtual Machine" is not available for free download in source code or either in binary form.
I guess it's only available for Qualcomm customers by signing NDA.
This prevents actually to run own home builded kernels on real QDSP6.

Personally I am a homebrew hobby developer, who like to experiment with different intersting stuffs inside modern smartphones.
And I want to have ability to be able build and run Linux kernel on Hexagon processor.
I think including of such code inside official open source kernel is nosense - Qualcomm wants to use open source community for own interests and doesn't provide anything back by hidding actual hardware interfaces inside non-available free Virtual Machine. 

So I am curious why it's included to the offiicial mainline kernel tree? 
And how kernel developers are going to support and test that architecture without actual running hardware?

Can this be threaded as GPL violation?
This kernel code is using "trap1" instruction which actually switches execution to exception handler located inside VM.


Best regards,
-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
       [not found] ` <CAHrUA364XES66kXhr0Gg1dh_MQBAS0+R8Q4x+EY3dgz6s=QRww@mail.gmail.com>
@ 2013-02-15 22:33   ` Linas Vepstas
  2013-02-16  1:35     ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-15 22:33 UTC (permalink / raw)
  To: linux-hexagon; +Cc: cotulla

Hi,

On 15 February 2013 08:28, <cotulla@yandex.ru> wrote:
>
> However after deep looking inside those patches I found that
> this code is only supposed to be run on top of some mythic "Hexagon Virtual Machine".


Its documented at https://developer.qualcomm.com/hexagon-processor
https://developer.qualcomm.com/download/80-nb419-3ahexagonvirtualmachinespec.pdf

> "Hexagon Virtual Machine" is not available for free download in source
> code or either in binary form.


I think it was supposed to be, I'm not sure why its not.  There had been some
grand plans for hexagon, but they never quite materialized.  Its far more
power-efficient than even the most efficient ARM's (I'm guessing it might be
the most power-efficient (32/64-bit) processor ever?), and that was going to
be its selling point.  I think they could not figure out how to sell
it, or to whom.

> I guess it's only available for Qualcomm customers by signing NDA.
> This prevents actually to run own home builded kernels on real QDSP6.

Well, I'm not sure, but I'm guessing that if you prowl around on the ROM,
you'll find a copy of it in there somewhere.  And if you don't, its actually a
mini-VM, its maybe a few hundred lines of assembly.  It really did very little
other than enabling and dispatching first-level interrupts. Perhaps I'm not
thinking clearly, but I'm guessing that you can hack around having it at all.
I don't remember if there's some must-use-it-or-die undocumented  register
in there or not.

> I think including of such code inside official open source kernel is nosense -

Most of the other vendors do it to. Certainly, most of the powerpc
code requires
an extremely large and complicated hypervisor -- something like millions of
lines of code -- and its extremely proprietary.

> And how kernel developers are going to support and test that architecture
> without actual running hardware?

While I was at Qualcomm, I saw development boards that were supposed
to be generally available for a few hundred $$.  They were basically
cell-phones
with the radio disabled (to avoid complications with the carriers) and some
extra headers soldered on for hardware debuggers.  Qualcomm was even
funding someone to build a very low-cost (sub $500) semi/mostly(?) open-source
hardware debugger,  something that could read the JTAG bits and
halt/single-step
the CPU.   The point of the low-cost was to allow hobbyists to get
into it ..  since
paying $5K to debug a $200 board only makes sense for NDA-signing corporations
who can afford it.

I left the company before this stuff was finalized, but they were very
excited to
get this stuff going.  I don't know what happened.

--Linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-15 22:33   ` Linas Vepstas
@ 2013-02-16  1:35     ` cotulla
  2013-02-16  2:34       ` Linas Vepstas
  0 siblings, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-16  1:35 UTC (permalink / raw)
  To: linux-hexagon

Hello,


> šIts documented at https://developer.qualcomm.com/hexagon-processor
> šhttps://developer.qualcomm.com/download/80-nb419-3ahexagonvirtualmachinespec.pdf

Yes, I saw and active using that document. But it's only a specification.
A lot of things are not really described there. And kernel implementation is rather messy - 
for example, in some places it's using HVM interface for cache operations and in the another it is executing native commands. 


> šI think it was supposed to be, I'm not sure why its not. šThere had been some
> šgrand plans for hexagon, but they never quite materialized. šIts far more
> špower-efficient than even the most efficient ARM's (I'm guessing it might be
> šthe most power-efficient (32/64-bit) processor ever?), and that was going to
> šbe its selling point. šI think they could not figure out how to sell
> šit, or to whom.

Interesting. But is it rather slow? Especially a memory access?


> šWell, I'm not sure, but I'm guessing that if you prowl around on the ROM,
> šyou'll find a copy of it in there somewhere. 

No, it doesn't present in any ROM. I already checked that.
As well as it is useless inside without actual Linux image.


>šAnd if you don't, its actually ašmini-VM, its maybe a few hundred lines of assembly. šIt really did very little
> šother than enabling and dispatching first-level interrupts. Perhaps I'm not
> šthinking clearly, but I'm guessing that you can hack around having it at all.
> šI don't remember if there's some must-use-it-or-die undocumented šregister
> šin there or not.

One important part is a software MMU implementation, because Hexagon doesn't have hardware MMU, only standalone TLB entries.
But HVM provides that functionality to the guest.
As well as most supervisor mode registers are not really defined, like "bit 18 of SSR is interrupt mask flag".

Of course it's not a problem for myself, I already researched most things.
It should be enough to bring native Hexagon kernel instead of fake hypervisor one.
I am working with HTC LEO which has QSD8250B/QDSP6v2. 
I already can run QDSP6 code in polling-bootloader mode.


> šMost of the other vendors do it to. Certainly, most of the powerpc
> šcode requires
> šan extremely large and complicated hypervisor -- something like millions of
> šlines of code -- and its extremely proprietary.
Well, I guess they provide those hypervisors with hardware kinda like BIOS?
In Hexagon case it's actually dead code. 
What is the point to include such thing inside mainline kernel?


> šWhile I was at Qualcomm, I saw development boards that were supposed
> što be generally available for a few hundred $$. šThey were basically
> šcell-phones
> šwith the radio disabled (to avoid complications with the carriers) and some
> šextra headers soldered on for hardware debuggers. šQualcomm was even
> šfunding someone to build a very low-cost (sub $500) semi/mostly(?) open-source
> šhardware debugger, šsomething that could read the JTAG bits and
> šhalt/single-step
> šthe CPU. ššThe point of the low-cost was to allow hobbyists to get
> šinto it .. šsince
> špaying $5K to debug a $200 board only makes sense for NDA-signing corporations
> šwho can afford it.

Well, it sounds well. But what reason to push code to _mainline_ before all that is done?
Is it really good to include random bullshit into kernel to make it cluttered?
As well in the neighboring mail list people tells that QCT camera code can not be included to the mainline because it has private service for userspace.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16  1:35     ` cotulla
@ 2013-02-16  2:34       ` Linas Vepstas
  2013-02-16 12:39         ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-16  2:34 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 15 February 2013 19:35,  <cotulla@yandex.ua> wrote:

>>  Its far more
>>  power-efficient than even the most efficient ARM's (I'm guessing it might be
>>  the most power-efficient (32/64-bit) processor ever?), and that was going to
>>  be its selling point.  I think they could not figure out how to sell
>>  it, or to whom.
>
> Interesting. But is it rather slow?

For the qdsp6v3 the effective clock rate was 300MHz per core, so yes.
It might be even slower for v2, not sure.  (the chip clock rate is 1.8
GHz, there are 6 interleaved cores, so 1.8/6 = 300  The power savings
are not from the clock rate, but from the tiny transistor count. The
performance efficiency is from keeping all of those transistors
constantly wiggling, which is what the interleaved pipeline does.)

> Especially a memory access?

Noo .. in fact, that slow clock rate meant it's more or less the same
speed as DRAM.  So low penalty for cache miss.  But there's also a bus
between here and there, and I think it was selected for simplicity not
speed...

> No, it doesn't present in any ROM. I already checked that.
> As well as it is useless inside without actual Linux image.

Oh, well, no, the VM wasn't created for linux, it was created for all
the other junk they try to run on there (eg audio processing, noise
cancellation, etc.), so that there would be a forward migration path
for that software with each new chip generation.  Whether those other
software parts actually used the thing, and in what time-frame or
release .. hey .. typical big-company inter-divisional politics,
bickering, etc.  The truly daring wanted all those other parts to run
on the linux kernel.  The old-school was like "hell no we don't need
no stinkin OS, we'll hand code everything in assembly" which sounds
great until you have ten support staff helping them kludge up a really
really badly designed home-grown linker/loader/scheduler/irq-handler.
Sometimes, really smart people just don't get it.

> One important part is a software MMU implementation, because Hexagon doesn't have hardware MMU, only standalone TLB entries.
> But HVM provides that functionality to the guest.

Don't know v2. But v3 had a 'real' MMU

> As well as most supervisor mode registers are not really defined, like "bit 18 of SSR is interrupt mask flag".

:-)

> I already can run QDSP6 code in polling-bootloader mode.

Good, because the bootloader was going to be the other issue.

> Well, it sounds well. But what reason to push code to _mainline_ before all that is done?
> Is it really good to include random bullshit into kernel to make it cluttered?

I'd done the patches for glibc (yes, they're publicly available on
some website, don't know if they got merged or not), got 98% of the
many hundreds of glibc unit tests to pass, including most or all of
the thread tests including TLS. Someone had bootstrapped hundreds of
.debs and both python and perl passed 100% of their tests.  I'm sure
no one cares, but even guile worked, and I was about to start fiddling
with haskell :-)

--linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16  2:34       ` Linas Vepstas
@ 2013-02-16 12:39         ` cotulla
  2013-02-16 17:33           ` Linas Vepstas
                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: cotulla @ 2013-02-16 12:39 UTC (permalink / raw)
  To: linux-hexagon

Hi,

> šFor the qdsp6v3 the effective clock rate was 300MHz per core, so yes.
> šIt might be even slower for v2, not sure. š(the chip clock rate is 1.8
> šGHz, there are 6 interleaved cores, so 1.8/6 = 300 šThe power savings
> šare not from the clock rate, but from the tiny transistor count. The
> šperformance efficiency is from keeping all of those transistors
> šconstantly wiggling, which is what the interleaved pipeline does.)

Hm, I thought the maximum clock rate is 595.2 Mhz?
Or 1.8 is another clock?
But by changing this clock rate I can get different Q6 performance.

> šNoo .. in fact, that slow clock rate meant it's more or less the same
> šspeed as DRAM. šSo low penalty for cache miss. šBut there's also a bus
> šbetween here and there, and I think it was selected for simplicity not
> šspeed...
I thought that it can be bus arbiter setting, but other masters like MODEM and APPS were almost idle during test.
I think in that case Q6 can take whole bandwidth or is it still limited somehow?


> šOh, well, no, the VM wasn't created for linux, it was created for all
> šthe other junk they try to run on there (eg audio processing, noise
> šcancellation, etc.), so that there would be a forward migration path
> šfor that software with each new chip generation. šWhether those other
> šsoftware parts actually used the thing, and in what time-frame or
> šrelease .. hey .. typical big-company inter-divisional politics,
> šbickering, etc. šThe truly daring wanted all those other parts to run
> šon the linux kernel. šThe old-school was like "hell no we don't need
> šno stinkin OS, we'll hand code everything in assembly" which sounds
> šgreat until you have ten support staff helping them kludge up a really
> šreally badly designed home-grown linker/loader/scheduler/irq-handler.
> šSometimes, really smart people just don't get it.

Well, sometimes such code becomes too large and slow, while doesn't provide universal solution for all cases.
As well as more code == more bugs inside and harder to follow logic inside. 

> šDon't know v2. But v3 had a 'real' MMU
Hm, are you sure in that? 
I had never seen any usage of it. As well as binutils registers definition doesn't include any suitable registers for that. 

> šGood, because the bootloader was going to be the other issue.
Yes, in my case it's working :)
But another guys who also want participate in this project with MSM8960/APQ8064 they still can't run any unsigned code on Q6.
In modern phones it's often locked from changes :(

> šI'd done the patches for glibc (yes, they're publicly available on
> šsome website, don't know if they got merged or not), got 98% of the
> šmany hundreds of glibc unit tests to pass, including most or all of
> šthe thread tests including TLS. Someone had bootstrapped hundreds of
> š.debs and both python and perl passed 100% of their tests. šI'm sure
> šno one cares, but even guile worked, and I was about to start fiddling
> šwith haskell :-)
Good to hear that. Good job!
So userspace support is rather good in common.
Maybe we will try to compile and run Android on that 6 cores pff hardware threads :D


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16 12:39         ` cotulla
@ 2013-02-16 17:33           ` Linas Vepstas
  2013-02-16 19:21             ` cotulla
  2013-02-19  4:36           ` rkuo
  2013-02-23  4:24           ` Rob Landley
  2 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-16 17:33 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 16 February 2013 06:39,  <cotulla@yandex.ua> wrote:
> In modern phones it's often locked from changes :(

!  I did not know that.  BTW, be aware that there are usually two
different qdsp's on there.  The smaller/older/slower one is running
the radio signal processing code and it is very locked down, as that
code is considered to be extremely valuable.  (Its also the one that
has total control over all the master interrupt masks, which not even
the ARM can over-ride).  I'm not sure, the radio processor might not
even be a qdsp6, it might be a qdsp5 or qdsp4, I never paid attention
to that.  But the point is:  if it seems like its locked down, make
sure you've got the right one, that there isn't another one sitting
there on the JTAG chain.

-- Linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16 17:33           ` Linas Vepstas
@ 2013-02-16 19:21             ` cotulla
  0 siblings, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-16 19:21 UTC (permalink / raw)
  To: linux-hexagon


> š! šI did not know that. šBTW, be aware that there are usually two
> šdifferent qdsp's on there. šThe smaller/older/slower one is running
> šthe radio signal processing code and it is very locked down, as that
> šcode is considered to be extremely valuable. š(Its also the one that
> šhas total control over all the master interrupt masks, which not even
> šthe ARM can over-ride). šI'm not sure, the radio processor might not
> ševen be a qdsp6, it might be a qdsp5 or qdsp4, I never paid attention
> što that. šBut the point is: šif it seems like its locked down, make
> šsure you've got the right one, that there isn't another one sitting
> šthere on the JTAG chain.
>

It depends from QCT chipset.
In old ones ARM + QDSP4 are used for modem operations.
And QDSP5 or QDSP6 for multimedia tasks.

New ones contains 2 x QDSP6 for modem usage and 1 x QDSP6 for audio.
So actually in some chipsets there are up to 3 QDSP6 instances nowdays.

But a problem that it's heavy locked nowdays. (signature checking, hardware key + hardware crypto engine)
So it's hard task to load own code to QDSP6 :-(


So you worked primary on userland QDSP6 support?


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16 12:39         ` cotulla
  2013-02-16 17:33           ` Linas Vepstas
@ 2013-02-19  4:36           ` rkuo
  2013-02-19 14:29             ` Linas Vepstas
  2013-02-20  1:17             ` cotulla
  2013-02-23  4:24           ` Rob Landley
  2 siblings, 2 replies; 32+ messages in thread
From: rkuo @ 2013-02-19  4:36 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On Sat, Feb 16, 2013 at 04:39:49PM +0400, cotulla@yandex.ua wrote:
> >  Don't know v2. But v3 had a 'real' MMU
> Hm, are you sure in that? 
> I had never seen any usage of it. As well as binutils registers definition doesn't include any suitable registers for that. 

Both have an MMU, but both are software managed.  That's one of the
functions that the hypervisor provides for us.

 
> >  Good, because the bootloader was going to be the other issue.
> Yes, in my case it's working :)
> But another guys who also want participate in this project with MSM8960/APQ8064 they still can't run any unsigned code on Q6.
> In modern phones it's often locked from changes :(

Congrats on getting the bootloader working.

Linas already answered some of your questions (thanks Linas!)

Hexagon Linux so far has been just used on internal projects; it's
not a part of any products even though the processor itself is.  A
non-hypervisor port is certainly possible, but for our purposes
around here, it's always running under the hypervisor.


Thanks,
Richard Kuo


-- 

Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-19  4:36           ` rkuo
@ 2013-02-19 14:29             ` Linas Vepstas
  2013-02-20  1:07               ` cotulla
  2013-02-20  1:17             ` cotulla
  1 sibling, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-19 14:29 UTC (permalink / raw)
  To: rkuo; +Cc: cotulla, linux-hexagon

On 18 February 2013 22:36, rkuo <rkuo@codeaurora.org> wrote:
> On Sat, Feb 16, 2013 at 04:39:49PM +0400, cotulla@yandex.ua wrote:
>> >  Don't know v2. But v3 had a 'real' MMU
>> Hm, are you sure in that?
>> I had never seen any usage of it. As well as binutils registers definition doesn't include any suitable registers for that.
>
> Both have an MMU, but both are software managed.  That's one of the
> functions that the hypervisor provides for us.

I might be confused, but I believe that this is actually very common;
the difference is that the so-called 'hardware MMU''s are actually
just nanocode burned into a ROM that's on the cpu chip die.  Its
invisible to the user. FWIW some deprecated instructions are also
emulated this way: they fault to a handler. You can hide a lot of
stuff this way. The powerpc does this, they learned it from the
mainframe 370/390 architecture, which abuses this (did you really
think the 'startio' instruction, which behaves like a DMA device
driver, was actually a single cycle hardware instruction?)

The hexagon doesn't have a (self-loading) ROM, so the mini-VM has to
be loaded at boot.

-- Linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-19 14:29             ` Linas Vepstas
@ 2013-02-20  1:07               ` cotulla
  0 siblings, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-20  1:07 UTC (permalink / raw)
  To: linux-hexagon


> šI might be confused, but I believe that this is actually very common;
> šthe difference is that the so-called 'hardware MMU''s are actually
> šjust nanocode burned into a ROM that's on the cpu chip die. šIts
> šinvisible to the user. FWIW some deprecated instructions are also
> šemulated this way: they fault to a handler. You can hide a lot of
> šstuff this way. The powerpc does this, they learned it from the
> šmainframe 370/390 architecture, which abuses this (did you really
> šthink the 'startio' instruction, which behaves like a DMA device
> šdriver, was actually a single cycle hardware instruction?)
>
> šThe hexagon doesn't have a (self-loading) ROM, so the mini-VM has to
> šbe loaded at boot.
>

I mean that Hexagon has a simular thing like MIPS. 
Unlike ARM hardware does not fetch page table entries from RAM, it's using standalone TLB entries.
In ARM they are also present but usually ever low level code doesn't work with them, low level code works with page tables in the RAM.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-19  4:36           ` rkuo
  2013-02-19 14:29             ` Linas Vepstas
@ 2013-02-20  1:17             ` cotulla
  1 sibling, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-20  1:17 UTC (permalink / raw)
  To: linux-hexagon


> šBoth have an MMU, but both are software managed. šThat's one of the
> šfunctions that the hypervisor provides for us.

Yes, like MIPS one :)

> šCongrats on getting the bootloader working.
>

We bring kernel already up to timers init :)
It was rather hard task to add support for PHYS_OFFSET, HVM kernel assumed that physical memory started at 0x00000000 always.

But some things are not really clear:
1. How hardware define which SMP core takes interrupt if all iassign bits are set (0x3F)?
Does every thread starts execute exception handler or only one, selected by hardware? 
How preemtion is disabled in that case?

2. What difference between ciad and cswi? 
I guess ciad is "Clear Interrupt Ask Done" and cswi is "Clear SoftWare Interrupt"?

3. How preemtion can be done during exception handlers execution?
Exception bit in SSR disallows to enter new exception handler?
I assume it's possible to clear exception bit and continue execution in normal mode.

If it's possible, can a bit explain/help with those things?


> šHexagon Linux so far has been just used on internal projects; it's
> šnot a part of any products even though the processor itself is. šA
> šnon-hypervisor port is certainly possible, but for our purposes
> šaround here, it's always running under the hypervisor.
>
But some comments inside code tells that it was native at the start ;)


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-16 12:39         ` cotulla
  2013-02-16 17:33           ` Linas Vepstas
  2013-02-19  4:36           ` rkuo
@ 2013-02-23  4:24           ` Rob Landley
  2013-02-24 12:00             ` cotulla
  2013-02-24 12:23             ` cotulla
  2 siblings, 2 replies; 32+ messages in thread
From: Rob Landley @ 2013-02-23  4:24 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 02/16/2013 06:39:49 AM, cotulla@yandex.ua wrote:
> Hi,
> 
> >  For the qdsp6v3 the effective clock rate was 300MHz per core, so  
> yes.
> >  It might be even slower for v2, not sure.  (the chip clock rate is  
> 1.8
> >  GHz, there are 6 interleaved cores, so 1.8/6 = 300  The power  
> savings
> >  are not from the clock rate, but from the tiny transistor count.  
> The
> >  performance efficiency is from keeping all of those transistors
> >  constantly wiggling, which is what the interleaved pipeline does.)
> 
> Hm, I thought the maximum clock rate is 595.2 Mhz?
> Or 1.8 is another clock?
> But by changing this clock rate I can get different Q6 performance.

The clever thing hexagon did was avoid any pipeline interlocks. Instead  
they had as many register profiles as pipeline stages, and they  
round-robined them down the pipeline. So the v2 processor ran at 600  
mhz but presented to Linux as a 6-way SMP chip each running at 100 mhz.

This meant there were 6 clock cycles between each memory access, so the  
DRAM had no trouble keeping up. There was no speculative execution, no  
branch prediction, it never did wasted work and any pipeline stage that  
had nothing to do powered down completely for that clock cycle. They  
got performance out of it via massive parallelism: each instruction was  
a 4-issue VLIW, and the latter two cores were 4-way SIMD vector  
thingies, so if you could break your task into 6 chunks (4 graphics  
processes, an audio process, and a control process) it could do some  
quite heavy lifting.

In the later chips, they were looking to reduce the number of pipeline  
stages, which would let them clock the chip down (increasing the power  
efficiency, power consumption increases exponentially with clock speed)  
while still allowing each thread to progress at 100 mhz. So a 300 mhz  
chip is probably a 3 stage pipeline presenting as 3 way SMP.

I only did a 6 month contract there in 2010 beating bugs out of the  
toolchain. I know they hired Linutronix to help clean up their code so  
it had a chance of being accepted upstream, but tglx and crowd had to  
sign an NDA so I dunno what they're allowed to say about it, even now  
that some of the code's gone upstream.

> >  Don't know v2. But v3 had a 'real' MMU
> Hm, are you sure in that?
> I had never seen any usage of it. As well as binutils registers  
> definition
> doesn't include any suitable registers for that.

The version I saw (v2) had a software loaded TLB which a binary blob  
made act like an MMU. It had too few TLB slots and kept thrashing them  
when running a real OS, so they were going to add more in a future  
version.

The thing to realize about Qualcom is that the lawyers are in charge.  
The patent licensing revenue is credited to the legal department but  
the R&D costs of coming up with that IP in the first place is deducted  
from engineering, so in terms of _net_ revenue it looks like licensing  
is more profitable than engineering even though it's just a fancy  
story. Political power within the company is based on how much net  
revenue you're bringing in, and with Legal mooching off engineering  
like that they get to overrule them most of the time.

So they've got brilliant engineers who do brilliant thigns you never  
hear about, and would LIKE to get them out into the real world but can  
never get permission. (Hence craziness like the "Code Aurora Forum"  
which is a partnership between Qualcomm and Qualcomm with some random  
co-signer (Intel) there to make it SEEM like somebody else is involved,  
because spinning off a wholly-owned subsidiary "Qualcomm Innovation  
Center" and having that sock puppet do all your open source stuff isn't  
considered enough of a firewall between Legal's precious patents and  
the GPL.

(Now add a bit of political infighting between the people who do their  
"Scorpion" licensed ARM core and the people who would like to see  
Hexagon used as a real processor instead of a multimedia coprocessor,  
and what little power engineering has is wasted.)

So it's realy cool technology, fairly widely deployed, and if you want  
to make use of it I'd recommend reverse engineering it. (You can look  
around the code aurora forum pages and download the toolchains they  
give to the android guys; those binary blobs get built with modified  
gcc+binutils and the lawyers scrupulously obey the letter of the law as  
they understand it; the code is published at an obscure URL somewhere.)

The fun part is that "objdump" can decode the magic instructions, even  
in the binary blob. Because it has to be able to compile them, you see.  
(They're working on Hexagon support for Open64 and LLVM, but gcc's  
still a more mature compiler. Google for "hexagon open64" and similar  
finds interesting stuff, by the way.)

> >  Good, because the bootloader was going to be the other issue.
> Yes, in my case it's working :)
> But another guys who also want participate in this project with  
> MSM8960/APQ8064 they still can't run any unsigned code on Q6.
> In modern phones it's often locked from changes :(

Getting hexagon support into QEMU would make life SO much easier...

> >  I'd done the patches for glibc (yes, they're publicly available on
> >  some website, don't know if they got merged or not), got 98% of the
> >  many hundreds of glibc unit tests to pass, including most or all of
> >  the thread tests including TLS. Someone had bootstrapped hundreds  
> of
> >  .debs and both python and perl passed 100% of their tests.  I'm  
> sure
> >  no one cares, but even guile worked, and I was about to start  
> fiddling
> >  with haskell :-)
> Good to hear that. Good job!
> So userspace support is rather good in common.

I built Linux From Scratch and large chunks of beyond linux from  
scratch during my contract in 2010 (put together a demo with X11,  
albeit just clients connecting an X server running on another machine  
through the net), but that was with their gcc 3.4, binutils 2.14, and  
uClibc 0.9.30 forks. (All of which were obsolete already when I was  
there, and have probably been abandoned since.)

That was using... comet boards, I think? (Those hacked up phone  
motherboards Linas was talking about. The "snapdragon" SoC, QDSP6v2  
chips plus a Scorpion plus an armv5 plus a QDSP4, all in a big ball  
with USB and a serial port and an ethernet device and 256 megs of  
memory and I forget what else. We had a small number of them because  
they never made that many. Not a mass produced product, semi-obsolete  
at the time, but the linux porting effort scrounged what resources it  
could...)

Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-23  4:24           ` Rob Landley
@ 2013-02-24 12:00             ` cotulla
  2013-02-24 16:32               ` Linas Vepstas
  2013-02-24 12:23             ` cotulla
  1 sibling, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-24 12:00 UTC (permalink / raw)
  To: linux-hexagon

> šThe clever thing hexagon did was avoid any pipeline interlocks. Instead
> šthey had as many register profiles as pipeline stages, and they
> šround-robined them down the pipeline. So the v2 processor ran at 600
> šmhz but presented to Linux as a 6-way SMP chip each running at 100 mhz.
>
> šThis meant there were 6 clock cycles between each memory access, so the
> šDRAM had no trouble keeping up. There was no speculative execution, no
> šbranch prediction, it never did wasted work and any pipeline stage that
> šhad nothing to do powered down completely for that clock cycle. They
> šgot performance out of it via massive parallelism: each instruction was
> ša 4-issue VLIW, and the latter two cores were 4-way SIMD vector
> šthingies, so if you could break your task into 6 chunks (4 graphics
> šprocesses, an audio process, and a control process) it could do some
> šquite heavy lifting.
>
Good to know.
During tests I got around 90 Mbytes/sec RAM access in simple memcpy on 256Mhz, while ARM got around 880 Mbytes/sec.
Do you think it's normal or maybe something configured wrong?
memcpy is done via simpliest way:
Memory is L1WB L2C. I & D prefetchings are enabled.

memcpy:
{ 
    loop0 (memcpy+0x10, r2)
    p0 = cmp.eq (r2, #0)
    r3 = r0 
}
if (p0) jumpr r31
r2 = memb (r1 ++ #1)
{ 
    nop
    memb (r3 ++ #1) = r2 
}:endloop0
jumpr r31


> šThe version I saw (v2) had a software loaded TLB which a binary blob
> šmade act like an MMU. It had too few TLB slots and kept thrashing them
> šwhen running a real OS, so they were going to add more in a future
> šversion.
>
Yes, like MIPS. Intersting how many TLBs ARM processors have?


> šSo it's realy cool technology, fairly widely deployed, and if you want
> što make use of it I'd recommend reverse engineering it. (You can look
> šaround the code aurora forum pages and download the toolchains they
> šgive to the android guys; those binary blobs get built with modified
> šgcc+binutils and the lawyers scrupulously obey the letter of the law as
> šthey understand it; the code is published at an obscure URL somewhere.)
>
> šThe fun part is that "objdump" can decode the magic instructions, even
> šin the binary blob. Because it has to be able to compile them, you see.
> š(They're working on Hexagon support for Open64 and LLVM, but gcc's
> šstill a more mature compiler. Google for "hexagon open64" and similar
> šfinds interesting stuff, by the way.)
Yes. 
Actually we got kernel to work up to userland init process!
Ofcourse there are still some not done things like SMP support or not optimized things (like it's flushing always full TLB instead of parts)
But it's already look rather well.

One of the problems - seems CONFIG_PREEMT is not supported with HVM code.
Maybe it's good idea to look at ARM or MIPS implementation and reimplement it from scratch.


> šThat was using... comet boards, I think? (Those hacked up phone
> šmotherboards Linas was talking about. The "snapdragon" SoC, QDSP6v2
> šchips plus a Scorpion plus an armv5 plus a QDSP4, all in a big ball
> šwith USB and a serial port and an ethernet device and 256 megs of
> šmemory and I forget what else. We had a small number of them because
> šthey never made that many. Not a mass produced product, semi-obsolete
> šat the time, but the linux porting effort scrounged what resources it
> šcould...)
Yea. QDS8x50B. My platform (HTC HD2) has exactly same SoC :-)


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-23  4:24           ` Rob Landley
  2013-02-24 12:00             ` cotulla
@ 2013-02-24 12:23             ` cotulla
  2013-02-26  6:55               ` Rob Landley
  1 sibling, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-24 12:23 UTC (permalink / raw)
  To: linux-hexagon

This is our current boot log:

http://pastie.org/private/slkvwbykfu4txqc1bq51q

However there are a lot of debug output.
But at least it already reaches the end of user mode program which consist from:

 puts("Hello from linux");
 while (1);



-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-24 12:00             ` cotulla
@ 2013-02-24 16:32               ` Linas Vepstas
  2013-02-24 17:29                 ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-24 16:32 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 24 February 2013 06:00,  <cotulla@yandex.ua> wrote:
> During tests I got around 90 Mbytes/sec RAM access in simple memcpy on 256Mhz, while ARM got around 880 Mbytes/sec.
> Do you think it's normal or maybe something configured wrong?

> memcpy is done via simpliest way:
> Memory is L1WB L2C. I & D prefetchings are enabled.
>
> memcpy:
> {
>     loop0 (memcpy+0x10, r2)
>     p0 = cmp.eq (r2, #0)
>     r3 = r0
> }
> if (p0) jumpr r31
> r2 = memb (r1 ++ #1)
> {
>     nop
>     memb (r3 ++ #1) = r2
> }:endloop0
> jumpr r31

Well.... you are using memb to copy one byte at a time, instead of
memd which will copy 8 bytes at a time.   So that change alone should
multiply performance by 8.

You can almost surely be more clever, and move the if statement and
the increment into a packet, so that they execute in parallel.  Its a
vliw, its very parallel, that needs to be exploited.  For example, (if
I remember correctly (?)) you can set and read r2 in the same packet,
the read will get the old value, the set will write the new value.
You can move the jumpr into the next packet, if you don't mind copying
one extra byte.   So there are many tricks like this.

I don't remember the details any longer, but there are cases where the
compiler would inline some fairly efficient memcpy code.   Also .. the
compiler is smart, in general --  you should try writing the loop in C
(and using 64-bit words) and seeing what it generates.

And finally, aren't there some optimized memcpy assembly routines in
the kernel? I know we kept talking about these...

--linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-24 16:32               ` Linas Vepstas
@ 2013-02-24 17:29                 ` cotulla
  2013-02-24 21:03                   ` Linas Vepstas
  0 siblings, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-24 17:29 UTC (permalink / raw)
  To: linux-hexagon


> šWell.... you are using memb to copy one byte at a time, instead of
> šmemd which will copy 8 bytes at a time. ššSo that change alone should
> šmultiply performance by 8.
>
> šYou can almost surely be more clever, and move the if statement and
> šthe increment into a packet, so that they execute in parallel. šIts a
> švliw, its very parallel, that needs to be exploited. šFor example, (if
> šI remember correctly (?)) you can set and read r2 in the same packet,
> šthe read will get the old value, the set will write the new value.
> šYou can move the jumpr into the next packet, if you don't mind copying
> šone extra byte. ššSo there are many tricks like this.
>
> šI don't remember the details any longer, but there are cases where the
> šcompiler would inline some fairly efficient memcpy code. ššAlso .. the
> šcompiler is smart, in general -- šyou should try writing the loop in C
> š(and using 64-bit words) and seeing what it generates.
>
> šAnd finally, aren't there some optimized memcpy assembly routines in
> šthe kernel? I know we kept talking about these...
>
> š--linas

Yes, there is an optimized memcpy version.
But my goal was to "compare" ARM performance with QDSP6 to know what to wait from it.
So I made a simple C code and tested it on both processors.

And there is a difference between them. My question was in general: do you think it's normal or QDSP6 should work faster/slower? 
Did you test performance inside Linux on Hexagon?


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-24 17:29                 ` cotulla
@ 2013-02-24 21:03                   ` Linas Vepstas
  2013-02-25 17:26                     ` Rob Landley
  0 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-24 21:03 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 24 February 2013 11:29,  <cotulla@yandex.ua> wrote:
>
>>  Well.... you are using memb to copy one byte at a time, instead of
>>  memd which will copy 8 bytes at a time.   So that change alone should
>>  multiply performance by 8.
>>
>>  You can almost surely be more clever, and move the if statement and
>>  the increment into a packet, so that they execute in parallel.  Its a
>>  vliw, its very parallel, that needs to be exploited.  For example, (if
>>  I remember correctly (?)) you can set and read r2 in the same packet,
>>  the read will get the old value, the set will write the new value.
>>  You can move the jumpr into the next packet, if you don't mind copying
>>  one extra byte.   So there are many tricks like this.
>>
>>  I don't remember the details any longer, but there are cases where the
>>  compiler would inline some fairly efficient memcpy code.   Also .. the
>>  compiler is smart, in general --  you should try writing the loop in C
>>  (and using 64-bit words) and seeing what it generates.
>>
>>  And finally, aren't there some optimized memcpy assembly routines in
>>  the kernel? I know we kept talking about these...
>>
>>  --linas
>
> Yes, there is an optimized memcpy version.
> But my goal was to "compare" ARM performance with QDSP6 to know what to wait from it.
> So I made a simple C code and tested it on both processors.

I don't understand what you are trying to do, or what you are
expecting.  The hardware is fundamentally different.  As Rob
explained, the hexagon is effectively clocked at the same rate as the
RAM, so you don't need a fancy bus or any fancy hardware to try to
coalesce reads/writes.   By contrast, the ARM is clocked so fast that
a simple bus design would cause it t to stall waiting on RAM. So I
assuming that they have a write-coalescing circuit, which collects up
bytes until they reach bus width, and only then pump the bus.

I don't know if the ARM does this, but I've seen cache designs
optimized for Java:  they noticed that java programs set a word to
zero, and then, a few cycles later, write a different value to the
same location. So this circuit doesn't push out the zero right away;
this keeps down the traffic to the cache, lessens bus contention.

I mean, who knows how the ARM is designed ..

> And there is a difference between them. My question was in general: do you think it's normal or QDSP6 should work faster/slower?

Well, I answered that question: if you ask the processor to write one
byte at a time, it will.  But you could ask it to write 8 bytes at a
time, and it will do that 8 times faster.  And if you make the
processor execute several no-ops  between writes, then, yes, your
write performance will decrease.  The VLIW has four slots per cycle.
Your example program used less than two slots per cycle, more than 60%
or 70% of your program consisted of no-ops. (the dis-assembler doesn't
normally print no-ops for empty slots, it just leaves them blank).

That said, I don't remember at all what the clock speeds or any of
that stuff, so I'm not sure what it should be. It changed between v2,
v3, v4, v5.

> Did you test performance inside Linux on Hexagon?

Yes ... the main result was that it was TLB-starved.  They guys
designing it are performance and watts-per-cycle crazy, they're very
devoted to optimizing this stuff, to getting the most per transistor
possible. Its a very tiny core with very few transistors.  I mean, its
probably smaller than the ARM register file (OK, I'm just making this
last one up, but I'm guessing it just might be true, I wouldn't be
surprised.).

-- Linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-24 21:03                   ` Linas Vepstas
@ 2013-02-25 17:26                     ` Rob Landley
  2013-02-26 18:54                       ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Rob Landley @ 2013-02-25 17:26 UTC (permalink / raw)
  To: linasvepstas; +Cc: cotulla, linux-hexagon

On 02/24/2013 03:03:37 PM, Linas Vepstas wrote:
> > Yes, there is an optimized memcpy version.
> > But my goal was to "compare" ARM performance with QDSP6 to know  
> what to wait from it.
> > So I made a simple C code and tested it on both processors.

You're comparing arm performance with QDSP6 by writing pessimal QDSP6  
code that does single-byte moves and keeps half the execution units  
idle. You're going to get some extremely useful numbers out of that,  
aren't you? (Even their uClibc port had an assembly optimized  
memmove().)

Is your arm code also doing single byte moves, with the requisite  
bit-shifting and masking that doing that on arm entails (since last I  
checked arm hasn't actually _got_ instructions that handle bytes,  
although maybe it went into thumb2 or v7 or v8 when I wasn't  
looking...)?

> > Did you test performance inside Linux on Hexagon?
> 
> Yes ... the main result was that it was TLB-starved.  They guys
> designing it are performance and watts-per-cycle crazy, they're very
> devoted to optimizing this stuff, to getting the most per transistor
> possible. Its a very tiny core with very few transistors.  I mean, its
> probably smaller than the ARM register file (OK, I'm just making this
> last one up, but I'm guessing it just might be true, I wouldn't be
> surprised.).

Specifically, the v2 hardware (in the snapdragon chipset in the Nexus  
One) has 6 register profiles (for the 6 pipeline stages, acting as  
6-way SMP) but performance peaked at "make -j 3" which ran very  
slightly faster than "make -j 4", and then -j 5 and -j 6 were each  
noticeably slower (due to TLB thrashing).

I believe that v3 had already taped out by then (late 2010, but it had  
fewer pipeline stages and thus register profiles anyway), and then v4  
was going to increase the TLB entries. What actually shipped was after  
my time, dunno the details.

Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-24 12:23             ` cotulla
@ 2013-02-26  6:55               ` Rob Landley
  2013-02-26 19:30                 ` cotulla
  2013-02-26 19:32                 ` cotulla
  0 siblings, 2 replies; 32+ messages in thread
From: Rob Landley @ 2013-02-26  6:55 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 02/24/2013 06:23:42 AM, cotulla@yandex.ua wrote:
> This is our current boot log:
> 
> http://pastie.org/private/slkvwbykfu4txqc1bq51q
> 
> However there are a lot of debug output.
> But at least it already reaches the end of user mode program which  
> consist from:
> 
>  puts("Hello from linux");
>  while (1);

I think I've missed chunks of this conversation: what are you booting  
it on?

(Wondering if I can get a test environment together. I haven't had one  
since I left qualcomm in 2010.)

Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-25 17:26                     ` Rob Landley
@ 2013-02-26 18:54                       ` cotulla
  2013-02-27  0:58                         ` Rob Landley
  0 siblings, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-26 18:54 UTC (permalink / raw)
  To: linux-hexagon


> šYou're comparing arm performance with QDSP6 by writing pessimal QDSP6
> šcode that does single-byte moves and keeps half the execution units
> šidle. You're going to get some extremely useful numbers out of that,
> šaren't you? (Even their uClibc port had an assembly optimized
> šmemmove().)
Well, is it simular to usual C/C++ code task and results of compilation?
Until you will do a manual assembler optimization.

> šIs your arm code also doing single byte moves, with the requisite
> šbit-shifting and masking that doing that on arm entails (since last I
> šchecked arm hasn't actually _got_ instructions that handle bytes,
> šalthough maybe it went into thumb2 or v7 or v8 when I wasn't
> šlooking...)?
ARM has LDRB and STRB instructions long time ago (ever in ARMv4)

Okay, seems this is really bad test.


> šSpecifically, the v2 hardware (in the snapdragon chipset in the Nexus
> šOne) has 6 register profiles (for the 6 pipeline stages, acting as
> š6-way SMP) but performance peaked at "make -j 3" which ran very
> šslightly faster than "make -j 4", and then -j 5 and -j 6 were each
> šnoticeably slower (due to TLB thrashing).
Intersting to know that. 
I want to get SSH access to got console and interaction with system.


> šI believe that v3 had already taped out by then (late 2010, but it had
> šfewer pipeline stages and thus register profiles anyway), and then v4
> šwas going to increase the TLB entries. What actually shipped was after
> šmy time, dunno the details.
v3 should be rather close to v2, but v4 seems to have few new features.
At the current moment all v2 source code works on v3.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26  6:55               ` Rob Landley
@ 2013-02-26 19:30                 ` cotulla
  2013-02-26 19:32                 ` cotulla
  1 sibling, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-26 19:30 UTC (permalink / raw)
  To: linux-hexagon


> šI think I've missed chunks of this conversation: what are you booting
> šit on?
>
> š(Wondering if I can get a test environment together. I haven't had one
> šsince I left qualcomm in 2010.)
>

Actually at the current time two persons are did some work in that project:
0)Cotulla (me) - HTC HD2 (HTC LEO) QSD8250B QDSP6v2
Image load done by custom MAGLDR bootloader.

1)jonpry - HP TouchPad (APQ8060) QDSP6v3
Image load done from kernel via PIL AFAIK.
He also tried it on APQ8064 with QDSP6v4 but no luck with loading unsigned LPASS images.


Current kernel code located here in GIT:
We took clear tree Linux 3.7.6 at the start and started to work with it.
https://github.com/detule/linux-hexagon


I am trying to bring up USB garget driver now.
Already found that arm/mach-msm is also nosense. In common it's bad, ugly code for something mythic. 
I was trying to use a built in chipidea driver, but it looks extremely overloaded and doesn't really fits for LEO hardware. 
I think I will try to port old msm7k_udc.c driver from 2.6.35. It's much more simple.


Also I found that adding "volatile" to variable declaration puts all accesses to it as not grouped into packets.
So it can be easy workaround for uncached memory access problem, instead of replacing everything to ioread32 and iowrite32.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26  6:55               ` Rob Landley
  2013-02-26 19:30                 ` cotulla
@ 2013-02-26 19:32                 ` cotulla
  2013-02-26 19:59                   ` Linas Vepstas
  2013-02-27  1:06                   ` Rob Landley
  1 sibling, 2 replies; 32+ messages in thread
From: cotulla @ 2013-02-26 19:32 UTC (permalink / raw)
  To: linux-hexagon


> šI think I've missed chunks of this conversation: what are you booting
> šit on?
>
> š(Wondering if I can get a test environment together. I haven't had one
> šsince I left qualcomm in 2010.)
>

Actually at the current time two persons are did some work in that project:
0)Cotulla (me) - HTC HD2 (HTC LEO) QSD8250B QDSP6v2
Image load done by custom MAGLDR bootloader.

1)jonpry - HP TouchPad (APQ8060) QDSP6v3
Image load done from kernel via PIL AFAIK.
He also tried it on APQ8064 with QDSP6v4 but no luck with loading unsigned LPASS images.


Current kernel code located here in GIT:
We took clear tree Linux 3.7.6 at the start and started to work with it.
https://github.com/detule/linux-hexagon


I am trying to bring up USB garget driver now.
Already found that arm/mach-msm is also nosense. In common it's bad, ugly code for something mythic. 
I was trying to use a built in chipidea driver, but it looks extremely overloaded and doesn't really fits for LEO hardware. 
I think I will try to port old msm7k_udc.c driver from 2.6.35. It's much more simple.


Also I found that adding "volatile" to variable declaration puts all accesses to it as not grouped into packets.
So it can be easy workaround for uncached memory access problem, instead of replacing everything to ioread32 and iowrite32.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26 19:32                 ` cotulla
@ 2013-02-26 19:59                   ` Linas Vepstas
  2013-02-26 20:25                     ` cotulla
  2013-02-27  1:06                   ` Rob Landley
  1 sibling, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-26 19:59 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 26 February 2013 13:32,  <cotulla@yandex.ua> wrote:
> I am trying to bring up USB garget driver now.
> Already found that arm/mach-msm is also nosense. In common it's bad, ugly code for something mythic.

Yeah.

One of the worst design errors in that code is that base addresses
were made into compile-time constants, and then, to add serious injury
to insult, a lot of complicated macros were written to obtain various
relative offsets.  The problem being that the hexagon uses exactly the
same offsets, but just different base addresses -- and they're
run-time configurable, e.g. by bootloader or by device tree.  Device
tree support for hexagon was planned, not sure about ARM.

I had started work on this, but I don't remember that I ever got it
clean enough to submit upstream.  That plus the fact that interrupt
routing was different:  hexagon has both second and third-level
interrupt controllers, with ethernet being on a third-level
controller.  I'd gotten much/most of the 2nd-level code done, and some
of the 3rd-level code done, and I think its in the mainline kernel
now, although I remember planning to do some more simplification and
cleanup.

> Also I found that adding "volatile" to variable declaration puts all accesses to it as not grouped into packets.

Yes, that sounds right.  Sorry, forgot about that.

> So it can be easy workaround for uncached memory access problem, instead of replacing everything to ioread32 and iowrite32.

Hmm. Either that or volatile needs to be added to the upstream
drivers: this is an issue for a variety of arches, not just hexagon.
Certainly powerpc treats treats uncached and i/o access in a very
different way than memory access, and if you don't do it right, life
will be painful.

--linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26 19:59                   ` Linas Vepstas
@ 2013-02-26 20:25                     ` cotulla
  2013-02-26 20:57                       ` Linas Vepstas
  0 siblings, 1 reply; 32+ messages in thread
From: cotulla @ 2013-02-26 20:25 UTC (permalink / raw)
  To: linux-hexagon

>
> šOne of the worst design errors in that code is that base addresses
> šwere made into compile-time constants, and then, to add serious injury
> što insult, a lot of complicated macros were written to obtain various
> šrelative offsets. šThe problem being that the hexagon uses exactly the
> šsame offsets, but just different base addresses -- and they're
> šrun-time configurable, e.g. by bootloader or by device tree. šDevice
> štree support for hexagon was planned, not sure about ARM.
>
for me it's normal to have hardcoded base addresses. it's not so bad.
I am talking about more general problems like useless code or code without enough data checks in my opinion.


Hexagon has same addresses for most shared devices except the private ones - like timers and second level interrupt controller.
For example, I am using exactly same base address for USB and it works.

> šI had started work on this, but I don't remember that I ever got it
> šclean enough to submit upstream. šThat plus the fact that interrupt
> šrouting was different: šhexagon has both second and third-level
> šinterrupt controllers, with ethernet being on a third-level
> šcontroller. šI'd gotten much/most of the 2nd-level code done, and some
> šof the 3rd-level code done, and I think its in the mainline kernel
> šnow, although I remember planning to do some more simplification and
> šcleanup.
Hm it seems dependent from SoC.
In QDS8250B there are two second level interrupt controllers (SIRC)
In common it can handle (32 - 2) + 2 * 32 interrupts.
I already got SIRC to work and receive USB interrupts without problems.

> šYes, that sounds right. šSorry, forgot about that.
> šHmm. Either that or volatile needs to be added to the upstream
> šdrivers: this is an issue for a variety of arches, not just hexagon.
> šCertainly powerpc treats treats uncached and i/o access in a very
> šdifferent way than memory access, and if you don't do it right, life
> šwill be painful.
Probably. It's just a thing which should be extremely careful looked during low level development for Hexagon - I am already few times was "catched" by that rule :-P


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26 20:25                     ` cotulla
@ 2013-02-26 20:57                       ` Linas Vepstas
  0 siblings, 0 replies; 32+ messages in thread
From: Linas Vepstas @ 2013-02-26 20:57 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 26 February 2013 14:25,  <cotulla@yandex.ua> wrote:

> Hexagon has same addresses for most shared devices except the private ones - like timers and second level interrupt controller.
> For example, I am using exactly same base address for USB and it works.

Hmm. That's odd. Certainly, the formal specs for the chipset put them
in different places, but this all depends on how you set up various
registers during boot.  Note, FYI, that among other things, that there
are registers than can make devices and address ranges
appear/disappear  for the various different cpus -- this was meant as
both a security and convenience feature, so that e.g. the arm and
hexagon would not both try to use the same device at the same time, or
that e.g. the radio modem could be hidden.

Again, the whole point of the device tree is to allow base addresses
to be boot-time configured, so that a given driver could work on a
variety of different systems.

>>  I had started work on this, but I don't remember that I ever got it
>>  clean enough to submit upstream.  That plus the fact that interrupt
>>  routing was different:  hexagon has both second and third-level
>>  interrupt controllers, with ethernet being on a third-level
>>  controller.  I'd gotten much/most of the 2nd-level code done, and some
>>  of the 3rd-level code done, and I think its in the mainline kernel
>>  now, although I remember planning to do some more simplification and
>>  cleanup.
> Hm it seems dependent from SoC.

Yes, very much so.  And there are many different SoC's.

> In QDS8250B there are two second level interrupt controllers (SIRC)
> In common it can handle (32 - 2) + 2 * 32 interrupts.
> I already got SIRC to work and receive USB interrupts without problems.

Yep. And some of the SIRC's are routed to several third-level
controllers.  I vaguely remember there being 168 interrupts total, or
something like that.

And, BTW, there's some craziness, where different devices can be
multiplexed onto a given interrupt line; its really crazy, but
everything that you would normally be interested in was default.  I
don't remember what the other crazy stuff was  ..video controllers or
dma engines or something like that.

--linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26 18:54                       ` cotulla
@ 2013-02-27  0:58                         ` Rob Landley
  2013-02-27 12:39                           ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Rob Landley @ 2013-02-27  0:58 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 02/26/2013 12:54:21 PM, cotulla@yandex.ua wrote:
> 
> >  You're comparing arm performance with QDSP6 by writing pessimal  
> QDSP6
> >  code that does single-byte moves and keeps half the execution units
> >  idle. You're going to get some extremely useful numbers out of  
> that,
> >  aren't you? (Even their uClibc port had an assembly optimized
> >  memmove().)
> Well, is it simular to usual C/C++ code task and results of  
> compilation?

Not for strcpy(), strcat(), memcpy(), memmove(), structure assignment...

> Until you will do a manual assembler optimization.

No, until your libc does the manual assembler optimization of common  
operations _for_ you (which is half the reason there's target-specific  
code), and you use the right functions.

> >  Is your arm code also doing single byte moves, with the requisite
> >  bit-shifting and masking that doing that on arm entails (since  
> last I
> >  checked arm hasn't actually _got_ instructions that handle bytes,
> >  although maybe it went into thumb2 or v7 or v8 when I wasn't
> >  looking...)?
> ARM has LDRB and STRB instructions long time ago (ever in ARMv4)
> 
> Okay, seems this is really bad test.

That was my point, yes.

I honestly dunno how good the hexagon optimizer in gcc is. (The  
qualcomm guys did extensive hand-hacked assembly, platform hasn't been  
all that widely used outside of there.) We've gone a touch beyond  
"duff's device" these days...

> >  Specifically, the v2 hardware (in the snapdragon chipset in the  
> Nexus
> >  One) has 6 register profiles (for the 6 pipeline stages, acting as
> >  6-way SMP) but performance peaked at "make -j 3" which ran very
> >  slightly faster than "make -j 4", and then -j 5 and -j 6 were each
> >  noticeably slower (due to TLB thrashing).
> Intersting to know that.
> I want to get SSH access to got console and interaction with system.

I miss having that myself. I built Aboriginal Linux on target and built  
Linux From Scratch and chunks of Beyond Linux From Scratch under it and  
the result was a fairly decent machine. I would love to support it in  
my upstream vanilla projects, but Qualcomm never gave me (or anyone  
else) the tools to do that outside of employee context.

> >  I believe that v3 had already taped out by then (late 2010, but it  
> had
> >  fewer pipeline stages and thus register profiles anyway), and then  
> v4
> >  was going to increase the TLB entries. What actually shipped was  
> after
> >  my time, dunno the details.
> v3 should be rather close to v2, but v4 seems to have few new  
> features.
> At the current moment all v2 source code works on v3.

They're backwards but not forwards compatible. v2 works on v3 and v4,  
but v4 may not work on v2. (Most architectures work that way.)

Over in the arm world, moving from armv4l to armv5l gives you about a  
25% performance boost (which means 25% more battery life if it races it  
idle faster), and Arm EABI standard requires thumb instructions  
(armv4tl). But otherwise, you can run the old code just fine.

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-hexagon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-26 19:32                 ` cotulla
  2013-02-26 19:59                   ` Linas Vepstas
@ 2013-02-27  1:06                   ` Rob Landley
  2013-02-27  1:30                     ` Linas Vepstas
  1 sibling, 1 reply; 32+ messages in thread
From: Rob Landley @ 2013-02-27  1:06 UTC (permalink / raw)
  To: cotulla; +Cc: linux-hexagon

On 02/26/2013 01:32:40 PM, cotulla@yandex.ua wrote:
> 
> >  I think I've missed chunks of this conversation: what are you  
> booting
> >  it on?
> >
> >  (Wondering if I can get a test environment together. I haven't had  
> one
> >  since I left qualcomm in 2010.)
> >
> 
> Actually at the current time two persons are did some work in that  
> project:
> 0)Cotulla (me) - HTC HD2 (HTC LEO) QSD8250B QDSP6v2
> Image load done by custom MAGLDR bootloader.
> 
> 1)jonpry - HP TouchPad (APQ8060) QDSP6v3
> Image load done from kernel via PIL AFAIK.
> He also tried it on APQ8064 with QDSP6v4 but no luck with loading  
> unsigned LPASS images.
> 
> 
> Current kernel code located here in GIT:
> We took clear tree Linux 3.7.6 at the start and started to work with  
> it.
> https://github.com/detule/linux-hexagon

Cool. I haven't got hardware to do anything with it at the moment.  
(With the possible exception of my phone, but it's in use as a phone.  
Haven't even put cyanogenmod on it.)

> I am trying to bring up USB garget driver now.
> Already found that arm/mach-msm is also nosense. In common it's bad,  
> ugly code for something mythic.

If you poke Thomas Gleixner, he _might_ be able to help. (He did work  
under NDA, but some of the restrictions may have been relaxed when the  
code went upstream. If he can't help, he'll say so.)

You won't get a lot of time out of him, but he may be able to answer  
questions about where code lives in various public trees...

> I was trying to use a built in chipidea driver, but it looks  
> extremely overloaded and doesn't really fits for LEO hardware.
> I think I will try to port old msm7k_udc.c driver from 2.6.35. It's  
> much more simple.

I really want qemu support for this. Sigh. Are there public assembly  
docs anywhere? (I had a giant printout brick circa 2010 but remember  
almost nothing from it and gave it back when I left...)

> Also I found that adding "volatile" to variable declaration puts all  
> accesses to it
> as not grouped into packets.

What I _do_ remember is that each instruction is 32 bits, and that one  
of those bits is the grouping bit which is either on or off (I forget  
which) to indicate the end of a packet.

If you build -O0 it sets the bit on every instruction, _and_ I think  
feeds in NOPS to position instructions to go to the "right" cores when  
you need to do something the first or second core can't do. (I think #3  
and #4 are identical, but could easily be wrong.)

> So it can be easy workaround for uncached memory access problem,  
> instead of replacing
> everything to ioread32 and iowrite32.

Oh there's a CPU cache. The problem is the instruction dispatch is a  
fraction of the clock speed, and the clock speed's low to begin with  
(both to keep power consumption down). The speed comes entirely from  
parallelism, you really need to take advantage of that on this chip or  
performance is going to suck.

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-hexagon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-27  1:06                   ` Rob Landley
@ 2013-02-27  1:30                     ` Linas Vepstas
  2013-02-27  3:03                       ` Rob Landley
  0 siblings, 1 reply; 32+ messages in thread
From: Linas Vepstas @ 2013-02-27  1:30 UTC (permalink / raw)
  To: Rob Landley; +Cc: cotulla, linux-hexagon

On 26 February 2013 19:06, Rob Landley <rob@landley.net> wrote:
> Cool. I haven't got hardware to do anything with it at the moment.

Used cellphones are amazingly cheap.

> I really want qemu support for this. Sigh. Are there public assembly docs
> anywhere?

Yes, google for them, they're on codeaurora somewhere. They've been
there since forever, actually.

-- Linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-27  1:30                     ` Linas Vepstas
@ 2013-02-27  3:03                       ` Rob Landley
  2013-02-27 12:35                         ` cotulla
  0 siblings, 1 reply; 32+ messages in thread
From: Rob Landley @ 2013-02-27  3:03 UTC (permalink / raw)
  To: linasvepstas; +Cc: cotulla, linux-hexagon

On 02/26/2013 07:30:23 PM, Linas Vepstas wrote:
> On 26 February 2013 19:06, Rob Landley <rob@landley.net> wrote:
> > Cool. I haven't got hardware to do anything with it at the moment.
> 
> Used cellphones are amazingly cheap.

What would be some good models to try? (I dunno what has v3 or v4 in  
it, not everything's got a qualcomm chipset...)

Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-27  3:03                       ` Rob Landley
@ 2013-02-27 12:35                         ` cotulla
  0 siblings, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-27 12:35 UTC (permalink / raw)
  To: linux-hexagon


> šWhat would be some good models to try? (I dunno what has v3 or v4 in
> šit, not everything's got a qualcomm chipset...)
>

It's hard to advice. 
MSM8660/APQ8060 contains V3;
MSM8960/APQ8064 contains V4;
It also must be unlocked to run unsigned code. Most of modern phones are locked and it's rather hard to unlock that.
Note that some SoC contains QDSP5, not QDSP6.


> What I _do_ remember is that each instruction is 32 bits, and that one
> of those bits is the grouping bit which is either on or off (I forget
> which) to indicate the end of a packet.

Yea, same thing makes it hard to disassemble - it's hard to know where packet starts. And program flow is working only on packet edges.


>
> I really want qemu support for this. Sigh. Are there public assembly
> docs anywhere? (I had a giant printout brick circa 2010 but remember
> almost nothing from it and gave it back when I left...)

You can google for:
80-NB419-1_A_hexagon_v2_programmers_ref.pdf
80-NB419-2_A_hexagon_abi_spec.pdf


>
> If you poke Thomas Gleixner, he _might_ be able to help. (He did work
> under NDA, but some of the restrictions may have been relaxed when the
> code went upstream. If he can't help, he'll say so.)
>
> You won't get a lot of time out of him, but he may be able to answer
> questions about where code lives in various public trees...
>
I think it's just question of work ;)
It just need a some time to make it work.

Actually my device doesn't support USB OTG fully, ID pin is not used right as well no internal VBUS.
So I need change it to be able specify UDC or UHC mode manually by hands.


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
  2013-02-27  0:58                         ` Rob Landley
@ 2013-02-27 12:39                           ` cotulla
  0 siblings, 0 replies; 32+ messages in thread
From: cotulla @ 2013-02-27 12:39 UTC (permalink / raw)
  To: linux-hexagon


> šI miss having that myself. I built Aboriginal Linux on target and built
> šLinux From Scratch and chunks of Beyond Linux From Scratch under it and
> šthe result was a fairly decent machine. I would love to support it in
> šmy upstream vanilla projects, but Qualcomm never gave me (or anyone
> šelse) the tools to do that outside of employee context.

I wanna get USB Ethernet connection, it looks suitable for remote access.
Another approach - get Android's ADB to run.

 
> šThey're backwards but not forwards compatible. v2 works on v3 and v4,
> šbut v4 may not work on v2. (Most architectures work that way.)
>
Well, v3 claims to have different ABI, so ever userland files are not compatible.
I am talking about low level system definitions on source level.
v4 has changes inside them (new TLB format and etc).


-Cotulla

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [DISCUSSION] Hexagon code inside kernel
@ 2013-02-24  0:24 Linas Vepstas
  0 siblings, 0 replies; 32+ messages in thread
From: Linas Vepstas @ 2013-02-24  0:24 UTC (permalink / raw)
  To: linux-hexagon, cotulla

On 19 February 2013 19:17, <cotulla@yandex.ua> wrote:
>
> Does every thread starts execute exception handler or only one, selected
> by hardware?
>
Only one, I believe, But its 'random' as to which one.

> How preemtion is disabled in that case?
>
? I believe interrupts are disabled upon entry to the exception handler.
 You have to re-enable them when you are ready, with I'm guessing ciad,
don't remember how this worked.  Each thread has its own local mask too.

There are like 4 or 5 or 6 bits that need to be set/cleared for interrupts
to be handled: there's a global bit to enable any interrupt. There's a
global mask to enable interrupts of a given priority, there's a local,
per-thread mask of the same.  There's an exception bit which is cleared by
return-from-exception.   I think there's one more bit, I can't remember.

>
> 2. What difference between ciad and cswi?
> I guess ciad is "Clear Interrupt Ask Done" and cswi is "Clear SoftWare
> Interrupt"?
>

There is a 'software interrupt' which allows one core to interrupt another.
 I think  cswi would be to clear those.

--linas

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-02-27 12:39 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-15 14:28 [DISCUSSION] Hexagon code inside kernel cotulla
     [not found] ` <CAHrUA364XES66kXhr0Gg1dh_MQBAS0+R8Q4x+EY3dgz6s=QRww@mail.gmail.com>
2013-02-15 22:33   ` Linas Vepstas
2013-02-16  1:35     ` cotulla
2013-02-16  2:34       ` Linas Vepstas
2013-02-16 12:39         ` cotulla
2013-02-16 17:33           ` Linas Vepstas
2013-02-16 19:21             ` cotulla
2013-02-19  4:36           ` rkuo
2013-02-19 14:29             ` Linas Vepstas
2013-02-20  1:07               ` cotulla
2013-02-20  1:17             ` cotulla
2013-02-23  4:24           ` Rob Landley
2013-02-24 12:00             ` cotulla
2013-02-24 16:32               ` Linas Vepstas
2013-02-24 17:29                 ` cotulla
2013-02-24 21:03                   ` Linas Vepstas
2013-02-25 17:26                     ` Rob Landley
2013-02-26 18:54                       ` cotulla
2013-02-27  0:58                         ` Rob Landley
2013-02-27 12:39                           ` cotulla
2013-02-24 12:23             ` cotulla
2013-02-26  6:55               ` Rob Landley
2013-02-26 19:30                 ` cotulla
2013-02-26 19:32                 ` cotulla
2013-02-26 19:59                   ` Linas Vepstas
2013-02-26 20:25                     ` cotulla
2013-02-26 20:57                       ` Linas Vepstas
2013-02-27  1:06                   ` Rob Landley
2013-02-27  1:30                     ` Linas Vepstas
2013-02-27  3:03                       ` Rob Landley
2013-02-27 12:35                         ` cotulla
2013-02-24  0:24 Linas Vepstas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.