Jump to content


APU and Linux


  • You cannot reply to this topic
12 replies to this topic

#1 yoshio_kashiwagi

    Newbie

  • Members
  • Pip
  • 2 posts
  • Location:Japan

Posted 14 August 2006 - 05:37 AM

Hi,

Although I am developing together with CoDeveloper and Linux, APU I/F has the problem which does not operate in Linux (not uCLinux). Of course by the standalone of EDK, it is checking in the circuit which operates normally.
In order to investigate a cause, I created the simple circuit (HelloWorld) by CoDeveloper, but by Linux, it does not operate too. However, the apu_loadstore circuit of the following application notes operates normally in Linux.

http://www.xilinx.com/bvdocs/appnotes/xapp717.pdf

Since an apu_loadstore circuit answers normally to an APU instruction in Linux, there must be no mistake in my software environment. I will investigate this problem continuously.

Thanks,

Yoshio kashiwagi

#2 Joshua

    Newbie

  • Members
  • Pip
  • 1 posts

Posted 20 October 2006 - 06:35 AM

Gidday there,

I am having similar problems with Linux 2.4.26 on the Virtex-4. We have a logic element connected to the APU that works fine when run in our boot loader (PPC-Boot Lite) but after setting the APU bit in the MSR while in Linux, a call to stwfcmx causes the processor to hang. We've tried setting some of the APU config register bits in the DCR, but nothing has an effect.

Joshua

#3 RalphBodenner

    Advanced Member

  • Admin
  • PipPipPip
  • 348 posts

Posted 24 October 2006 - 01:26 PM

Several customers have reported this problem; we're investigating it currently and will report back here when we have more information.

Regards,
Ralph
Ralph Bodenner
Impulse Accelerated Technologies, Inc.

#4 etrexel

    Advanced Member

  • Impulse Staff
  • PipPipPip
  • 260 posts

Posted 02 January 2007 - 10:14 AM

Hi,
This issue is not completely solved, but I thought I'd post some notes in hopes it will prove useful for someone for now.
Investigation to this point using the Adder example under a Linux 2.4.26 kernel has:
- Discovered a bug in the APU interface that caused a read from an empty stream's status register to lock up the CPU. This has been fixed and is now downloadable in CoDeveloper version 2.20.d3 and newer at http://www.impulse-support.com/ReleaseFiles/

- Shown that the first steam mapped into the APU interface MUST be outgoing. See co_init.c:co_initialize(), make sure that the first call to co_stream_attach() looks like:
co_stream_attach(<out stream name>,0,HW_OUTPUT);

- Shown the following behavior (not ideal, but it is consistent):
- The first data writen to the stream makes it through to the hardware process
- The first data coming back from the hardwre process is always lost

Notes of interest:
- The PowerPC assembly instructions for accesing the APU, 'stwfcmx' and 'lwfcmx', under Xilinx EDK are the same as 'stvewx' and 'lvewx', respectively, under GCC which uses their original names. These instructions are not priviledged, however, the MSR register must have the APU Enable bit set. To set the MSR correctly in the Linux kernel, please see the attached path file - to date, this is the only change necessary to the Linux kernel.

I will post any more findings, and ultimately a solution, as they become available. I also would like to encourage anyone else who may be looking into this to please post anything they may have as well.

Thanks,
Ed

Attached File(s)


Ed Trexel
Impulse Accelerated Technologies, Inc.

#5 kerp

    Member

  • Members
  • PipPip
  • 4 posts

Posted 18 May 2007 - 01:34 AM

Hi Ed,

I'm also interested in using APU from Linux. I currently have a 2.6.20 vanilla kernel taken from kernel.org with a couple of patches that allow it to run on the Avnet FX12 MM board supporting few peripherals.

I understand that some changes must be made to the kernel execution entry point code file related to our architecture (head_4xx.S) in order to enable APU functionality during initialization. However, before doing this I was wondering why the opcode that you stated in the last post is not supported by my toolchain. What version of the binutils package are you using? Do you know if the changes that Xilinx introduced into the GNU assembler (GNU binutils package) in order to support its custom APU controller related instructions are available to download? or maybe document in a way that could describe a couple of mapping between these custom APU controller related instructions and the default instructions implemented in the GNU assembler? Is this extended ISA described in a document ( other than http://www.xilinx.com/ise/embedded/ppc405_isaext_guide.pdf ) ?

I'm currently using toolchains with two different versions of binutils, and none of them support the stvewx or the lvewx instruction as you can see below:


coquitoVM:~# powerpc-linux-uclibc-as -v
GNU assembler version 2.17 (powerpc-linux-uclibc) using BFD version 2.17

coquitoVM:~# echo "stvewx" | powerpc-linux-uclibc-as -m405
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

coquitoVM:~# echo "stvewx" | powerpc-linux-uclibc-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'



coquitoVM:~# ppc_4xx-as -v
GNU assembler version 2.16.1 (powerpc-linux) using BFD version 2.16.1

coquitoVM:~# echo "stvewx" | ppc_4xx-as -m405
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

coquitoVM:~# echo "stvewx" | ppc_4xx-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'


Thanks,

Marcos
Fudepan ORG

#6 etrexel

    Advanced Member

  • Impulse Staff
  • PipPipPip
  • 260 posts

Posted 18 May 2007 - 10:51 AM

QUOTE (kerp @ May 18 2007, 03:34 AM) <{POST_SNAPBACK}>
Hi Ed,

I'm also interested in using APU from Linux. I currently have a 2.6.20 vanilla kernel taken from kernel.org with a couple of patches that allow it to run on the Avnet FX12 MM board supporting few peripherals.

I understand that some changes must be made to the kernel execution entry point code file related to our architecture (head_4xx.S) in order to enable APU functionality during initialization. However, before doing this I was wondering why the opcode that you stated in the last post is not supported by my toolchain. What version of the binutils package are you using? Do you know if the changes that Xilinx introduced into the GNU assembler (GNU binutils package) in order to support its custom APU controller related instructions are available to download? or maybe document in a way that could describe a couple of mapping between these custom APU controller related instructions and the default instructions implemented in the GNU assembler? Is this extended ISA described in a document ( other than http://www.xilinx.com/ise/embedded/ppc405_isaext_guide.pdf ) ?

I'm currently using toolchains with two different versions of binutils, and none of them support the stvewx or the lvewx instruction as you can see below:


coquitoVM:~# powerpc-linux-uclibc-as -v
GNU assembler version 2.17 (powerpc-linux-uclibc) using BFD version 2.17

coquitoVM:~# echo "stvewx" | powerpc-linux-uclibc-as -m405
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

coquitoVM:~# echo "stvewx" | powerpc-linux-uclibc-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'



coquitoVM:~# ppc_4xx-as -v
GNU assembler version 2.16.1 (powerpc-linux) using BFD version 2.16.1

coquitoVM:~# echo "stvewx" | ppc_4xx-as -m405
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

coquitoVM:~# echo "stvewx" | ppc_4xx-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'


Thanks,

Marcos
Fudepan ORG


Hi,
The toolchain I've been using outside of the Xilinx provided tools under EDK is crosstool-0.42 which built the assembler:
$ powerpc-405-linux-gnu-as -v
GNU assembler version 2.15 (powerpc-405-linux-gnu) using BFD version 2.15

Sorry I don't have all the details of how you get there, but I do know there are the options "--enable-altivec" during the build of the assembler I believe, and "-maltivec" which is a command line option for 'as'. I did try your example which gave me the same result:
$ echo "stvewx" | powerpc-405-linux-gnu-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

until I did:

$ echo "stvewx" | powerpc-405-linux-gnu-as -maltivec
{standard input}: Assembler messages:
{standard input}:1: Error: missing operand
{standard input}:1: Error: missing operand
{standard input}:1: Error: missing operand

Lastly, a note on the use of the APU interface: The interface doesn't always handle instruction preemption (flushing) well. Care must be taken to avoid interrupts while accessing the APU, so far the best way is to use ISR's (assuming they are prevented from nesting).

Hope that helps,
Ed
Ed Trexel
Impulse Accelerated Technologies, Inc.

#7 kerp

    Member

  • Members
  • PipPip
  • 4 posts

Posted 06 June 2007 - 04:25 AM

QUOTE (etrexel @ May 18 2007, 10:51 AM) <{POST_SNAPBACK}>
Hi,
The toolchain I've been using outside of the Xilinx provided tools under EDK is crosstool-0.42 which built the assembler:
$ powerpc-405-linux-gnu-as -v
GNU assembler version 2.15 (powerpc-405-linux-gnu) using BFD version 2.15

Sorry I don't have all the details of how you get there, but I do know there are the options "--enable-altivec" during the build of the assembler I believe, and "-maltivec" which is a command line option for 'as'. I did try your example which gave me the same result:
$ echo "stvewx" | powerpc-405-linux-gnu-as
{standard input}: Assembler messages:
{standard input}:1: Error: Unrecognized opcode: `stvewx'

until I did:

$ echo "stvewx" | powerpc-405-linux-gnu-as -maltivec
{standard input}: Assembler messages:
{standard input}:1: Error: missing operand
{standard input}:1: Error: missing operand
{standard input}:1: Error: missing operand

Lastly, a note on the use of the APU interface: The interface doesn't always handle instruction preemption (flushing) well. Care must be taken to avoid interrupts while accessing the APU, so far the best way is to use ISR's (assuming they are prevented from nesting).

Hope that helps,
Ed



Hi Ed,

More news about this. Using your options I was able to resolve all GNU assembler related issues. After that, enabling the Altivec support in the kernel made my "adder" example code compile and run.

Now I am facing another problems, apparently due to some hardware misbehavior:

- As you said before, the first data written to the stream makes it through to the hardware process and the first data coming back from the hardware process is always lost.

- Due to my needs of executing the binary multiple times, I have to comment HW_STREAM_CLOSE and use HW_STREAM_READ_NB in the Consumer function. This bring me a new problem, in this case, during each execution the last data coming back from the hardware process is always the same (Please find attached the logs that describe the issue). I forgot to mention that the streams depths in the hardware process are always 8.

- And to complicate a little bit more the situation I decided to use floating point in my example. Apparently, after few (four?) reads executed by the Consumer process, the data in the APU is not well rounded. Is this a problem with the binary floating-point arithmetic of the FPU used? What is the FPU used for floating point examples?


Thanks a lot!

Marcos
Fudepan

Attached File(s)



#8 etrexel

    Advanced Member

  • Impulse Staff
  • PipPipPip
  • 260 posts

Posted 06 June 2007 - 10:16 AM

Hi Marcos,
QUOTE
- As you said before, the first data written to the stream makes it through to the hardware process and the first data coming back from the hardware process is always lost.
What version of CoDeveloper are you running?
The original cause of this was the APU interface doing a misinterception of FPU accesses, the Linux kernel apparently was doing a scan to see if the FPU is present we believe. This had been fixed in 2.20.f.x. However, the APU interface is still unable to fully handle instruction flushes (due to interrupts, context switching, etc.) which may lead to lost data during reads or the potential of repeated data during writes.

QUOTE
- Due to my needs of executing the binary multiple times, I have to comment HW_STREAM_CLOSE and use HW_STREAM_READ_NB in the Consumer function. This bring me a new problem, in this case, during each execution the last data coming back from the hardware process is always the same (Please find attached the logs that describe the issue). I forgot to mention that the streams depths in the hardware process are always 8.

Please note that in Consumer(), the 'err' in the call:
HW_STREAM_READ_NB(consumer_proc, input_stream, j, err);
needs to be tested to detect valid data in 'j'.
Also please note that due to the use of the #pramga CO PIPELINE in the hardware process Sum(), you will also experience "pipeline stalling" - a pipeline requires EVERY stage to be ready in order to move forward. In the case of the pipelined loop in Sum(), this means that both the co_stream_read() and co_stream_write(), each is a blocking call, must be ready - this means that you will only get data out whenever you put data in, you have to basically "push" the data out with new data.

QUOTE
- And to complicate a little bit more the situation I decided to use floating point in my example. Apparently, after few (four?) reads executed by the Consumer process, the data in the APU is not well rounded. Is this a problem with the binary floating-point arithmetic of the FPU used? What is the FPU used for floating point examples?

This is actually due to the precision of the type 'float' and that it still uses binary to represent real numbers. You will see the same results with any floating point library, in fact the software simulation of your project outputted the same results. Use an initial value of 1.125 which is more binary friendly and you will see the difference.

Thanks,
Ed
Ed Trexel
Impulse Accelerated Technologies, Inc.

#9 kerp

    Member

  • Members
  • PipPip
  • 4 posts

Posted 06 June 2007 - 11:58 AM

Hi Ed,

QUOTE (etrexel @ Jun 6 2007, 10:16 AM) <{POST_SNAPBACK}>
What version of CoDeveloper are you running?
The original cause of this was the APU interface doing a misinterception of FPU accesses, the Linux kernel apparently was doing a scan to see if the FPU is present we believe. This had been fixed in 2.20.f.x. However, the APU interface is still unable to fully handle instruction flushes (due to interrupts, context switching, etc.) which may lead to lost data during reads or the potential of repeated data during writes.

I am using the latest version (2.20.h.3).


QUOTE (etrexel @ Jun 6 2007, 10:16 AM) <{POST_SNAPBACK}>
Please note that in Consumer(), the 'err' in the call:
HW_STREAM_READ_NB(consumer_proc, input_stream, j, err);
needs to be tested to detect valid data in 'j'.
Also please note that due to the use of the #pramga CO PIPELINE in the hardware process Sum(), you will also experience "pipeline stalling" - a pipeline requires EVERY stage to be ready in order to move forward. In the case of the pipelined loop in Sum(), this means that both the co_stream_read() and co_stream_write(), each is a blocking call, must be ready - this means that you will only get data out whenever you put data in, you have to basically "push" the data out with new data.

Sure, I was aware about this, actually I've been using "if (!err)" before using any data coming back from the APU. My doubt comes in when I realized that the last data that I write to the APU was consistently missing from read operations. The sum of the first issue + pipeline stalling could explain this, I will do some tests and I will get back to you with some results.


QUOTE (etrexel @ Jun 6 2007, 10:16 AM) <{POST_SNAPBACK}>
This is actually due to the precision of the type 'float' and that it still uses binary to represent real numbers. You will see the same results with any floating point library, in fact the software simulation of your project outputted the same results. Use an initial value of 1.125 which is more binary friendly and you will see the difference.

I made a quick test in the powerpc, and I am getting different (correct) results . Please find attached the code/logs. In this case also, I will get back to you when I have some test results.


Thanks for helping me in dealing with these problems,

Marcos
Fudepan

Attached File(s)



#10 etrexel

    Advanced Member

  • Impulse Staff
  • PipPipPip
  • 260 posts

Posted 06 June 2007 - 09:31 PM

Hi Marcos,
Version 2.20.h.3 is the best place to be. The missing very first output could potentially be avoided by pre-fetching the value being written to the stream to get the first cache miss out of the way (just a theory as the APU load/store transfers are always memory<->APU and not register<->APU). Let me know how your tests go.

QUOTE (kerp @ Jun 6 2007, 01:58 PM) <{POST_SNAPBACK}>
I made a quick test in the powerpc, and I am getting different (correct) results . Please find attached the code/logs. In this case also, I will get back to you when I have some test results.

'cout' appears to be rounding the output. In Number::operator+=(), try either using printf() or set the output precision via:
cout << setprecision(12); // must also have "#include <iomanip>" at the top of file

Thanks,
Ed
Ed Trexel
Impulse Accelerated Technologies, Inc.

#11 yoshio_kashiwagi

    Newbie

  • Members
  • Pip
  • 2 posts
  • Location:Japan

Posted 06 June 2007 - 11:22 PM

Hi,

In the Linux environment, standard binutils is not supporting the apu instruction.
I am provided with the source code of binutils of EDK from Xilinx, and am extending the instruction of binutils my Linux environment.

Moreover, the following setup is required of setup_arch of a kernel.

unsigned int msr;

msr = mfmsr();
mtmsr(msr | (XREG_APU_AVAILABLE | XREG_APU_ENABLE));

Furthermore, in order to confirm an APU instruction also in user space, it is necessary to change MSR_USER of a definition of include/asm-powerpc/reg.h as follows.

#ifdef CONFIG_XILINX_VIRTEX_APU
#define MSR_USER (MSR_KERNEL|MSR_PR|MSR_EE|MSR_VEC)
#else
#define MSR_USER (MSR_KERNEL|MSR_PR|MSR_EE)
#endif

Best Regards,

Yoshio Kashiwagi

#12 etrexel

    Advanced Member

  • Impulse Staff
  • PipPipPip
  • 260 posts

Posted 07 June 2007 - 08:36 AM

Thank you Yoshio for your post. Have you been using the APU interface under Linux? and, if so, other than the MSR settings, have you encountered anything else?

QUOTE (yoshio_kashiwagi @ Jun 7 2007, 01:22 AM) <{POST_SNAPBACK}>
In the Linux environment, standard binutils is not supporting the apu instruction.
I am provided with the source code of binutils of EDK from Xilinx, and am extending the instruction of binutils my Linux environment.


To which binutils tools are you referring? Not sure if you caught this in the earlier part of thsi thread, but non-Xilinx supplied GNU tools for the PowerPC know the APU instructions 'lwfcmx' and 'stwfcmx' by their original names 'lvewx' and 'stvewx' respectively. The exported driver code for the Virtex-4 is aimed at the GNU compiler supplied with XPS, but the couple changes below will work with non-XPS compilers (assuming the option '-maltivec' is turned on for the assembler):

Edit the exportted co.h and add the following code to the top just under the line #include "apu_if.h":
CODE

#ifdef LINUX
#include
#include
#define print(...) printf(__VA_ARGS__)
/* assembly mnemonics for apu instructions */
/* lwfcmx=lvewx, stwfcmx=stvewx for GNU compier vs. Xilinx EDK compiler */
#define lwfcmx(rn, base, adr) __asm__ __volatile__(\
"lvewx " #rn ",%0,%1\n"\
: : "b" (base), "r" (adr)\
)

#define stwfcmx(rn, base, adr) __asm__ __volatile__(\
"stvewx " #rn ",%0,%1\n"\
: : "b" (base), "r" (adr)\
)
#endif


and then in apu_if.h:
1) Wherever you see "lwfcmx" replace it with lvewx
2) Wherever you see "stwfcmx" replace it with stvewx
(there may be a better way than this)

Thanks,
Ed
Ed Trexel
Impulse Accelerated Technologies, Inc.

#13 kerp

    Member

  • Members
  • PipPip
  • 4 posts

Posted 11 June 2007 - 08:50 PM

Hi Ed,

QUOTE (etrexel @ Jun 6 2007, 09:31 PM) <{POST_SNAPBACK}>
Version 2.20.h.3 is the best place to be. The missing very first output could potentially be avoided by pre-fetching the value being written to the stream to get the first cache miss out of the way (just a theory as the APU load/store transfers are always memory<->APU and not register<->APU). Let me know how your tests go.


Sorry for my late response. Yesterday I spent some time doing tests on this.

At first glance, I was under the impression that this behavior could be caused by an FCM instruction flushed from the CPU pipeline and that the FCM could re-issue this flushed instruction corrupting its internal data. As a way to test this hypothesis, I had to change the StoreWBOK bit in the APU controller configuration register, forcing the APU controller to generate the APUFCMWRITEBACKOK signal for all FCM related instructions, signaling into the CPU that a point-of-no-return has been reached and after which the instruction cannot be flushed. Unfortunately,the bug was still there after reconfigure the APU controller configuration register.

Then I move on to think about the idea that this behavior could be a cache coherence problem. Unfortunately, most of the data cache invalidate and flush related instructions that guarantees data coherence run in privileged (kernel) mode and cannot be used from userspace, so there was no room to do anything else on this path.

Finally I went back to the code and made minor tweaks on it (enclose all HW_STREAM_READ_NB macro calls within the loop scope) and voila, when all else fails, this simple solution seems to be the correct one!. I am still don’t know exactly why this works flawlessly, but now every stream writed to the APU could be readed back in the CPU and nothing is missing. Please find attached the code and logs.

QUOTE (etrexel @ Jun 6 2007, 09:31 PM) <{POST_SNAPBACK}>
'cout' appears to be rounding the output. In Number::operator+=(), try either using printf() or set the output precision via:
cout << setprecision(12); // must also have "#include <iomanip>" at the top of file

You are rigth about this. There was a cout rounding, using setprecision(12) the behaviour of the code is the same as the one seen on the SW/HW design.

Thanks a lot!

Marcos

Attached File(s)







1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users