Help - Search - Members - Calendar
Full Version: why imageDMA is so slow to do DMA?
Impulse Support Forums > Impulse Forums > Xilinx Platform Support
mypowlar
hi, I don't know why the imageDMA is so slow to do DMA? It is very slower than memcpy().
As followed code, no operation except co_memory_readblock/co_memory_writeblock use
a long time! I don't know how to resolve the problem!
Can you help me! Thank you!


do{
co_signal_wait(go, &status);

co_memory_readblock(imgmem, 0, img, DATA_NUM*2); //read from imgmen to img

co_memory_writeblock(imgmem, 0, img, DATA_NUM*2);//from img to imgmem

co_signal_post(done, 1);
}while(1);
etrexel
QUOTE (mypowlar @ Apr 24 2007, 09:50 PM) *
hi, I don't know why the imageDMA is so slow to do DMA? It is very slower than memcpy().
As followed code, no operation except co_memory_readblock/co_memory_writeblock use
a long time! I don't know how to resolve the problem!
Can you help me! Thank you!
do{
co_signal_wait(go, &status);

co_memory_readblock(imgmem, 0, img, DATA_NUM*2); //read from imgmen to img

co_memory_writeblock(imgmem, 0, img, DATA_NUM*2);//from img to imgmem

co_signal_post(done, 1);
}while(1);


Hi,
What are the times you are seeing and how are you measuring time of memcpy vs. the DMA? Could you please post or email to support@impulsec.com the complete Impulse C project as well as any code related to the memcpy timing?

Thanks,
Ed
mypowlar
Hi,
I have email to support@impulsec.com the complete Impulse C project as well as code related to
memcpy.
etrexel
QUOTE (mypowlar @ Apr 25 2007, 01:00 AM) *
Hi,
I have email to support@impulsec.com the complete Impulse C project as well as code related to
memcpy.


Hi,
Thank you for sending in the code. I have not run the code on my ML403 (yet) but my intial thoughts are:

1) The data cache of the CPU is turned on, this will make the memcpy run much faster because the write's are being cached and the timing will not include the time for the data to be written to SDRAM. Changing the line in main():
XCache_EnableDCache(0x80000001);
to:
XCache_EnableDCache(0x00000001);

will turn off data caching for the lower memory where the SDRAM is so the operations that the CPU and DMA are doing will be much more similar.

2) The memcpy which takes a number of bytes to copy for the third parameter, doesn't appear to be copying the same amount of data. DATA_NUM is 256 and the target array in the hardware process is of type co_int16, so the memcpy() call should be changed from:
memcpy((int*)XPAR_SDRAM_8MX32_BASEADDR,(int*)0x0,100);
to:
memcpy((int*)XPAR_SDRAM_8MX32_BASEADDR,(int*)0x0,(DATA_NUM*sizeof(co_int16)));

in order to copy the same number of bytes.

3) Lastly, please note that the co_signal_wait() call by the CPU is polling the co_signal interface to the hardware process in a fairly tight loop. The CPU also has priority on the data bus which may slow down the DMA's access through contention for the bus. Isolating the CPU's path to the co_signal interface and the DMA's path to memory will improve the throughput of the DMA. This can be done by having the SDRAM on the OPB while the CPU usees the PLB (Virtex-4 PLB PSP) or APU (Virtex-4 APU PSP) to access the co_signal because the DMA currently can only use the OPB.

Hope this helps,
Ed
mypowlar
Hi,
Thank you for your reply.

1) The data cache of the CPU is turned on,I think it is useful in any application., so to turned down cache is impossible.
2) The memcpy which takes a number of bytes to copy for the third parameter. I have set them to same.
but no useful.


Can you give a suggestion wich impulseC process the data in BRAM directly not to pass data by
co_memory_readblock/co_memory_writeblock or stream_read/stream_write?
etrexel
Hi,
QUOTE (mypowlar @ Apr 25 2007, 08:25 PM) *
1) The data cache of the CPU is turned on,I think it is useful in any application., so to turned down cache is impossible.

I was only mentioning this for comparison reasons, I wouldn't suggest to do it for the final application, that would of course defeat the purpose of the data cache smile.gif

QUOTE (mypowlar @ Apr 25 2007, 08:25 PM) *
2) The memcpy which takes a number of bytes to copy for the third parameter. I have set them to same.
but no useful.

What are the times you are seeing for the DMA and memcpy()? How is the system configured or could you please send your EDK project's .mhs file? There is also the possibility that the DMA is suffering from element size (co_int16) vs. bus width (32/64 bits depending upon bus), but I also suspect (from experience) that the CPU is dominating the bus while polling the signal and keeping the DMA off of it.

QUOTE (mypowlar @ Apr 25 2007, 08:25 PM) *
Can you give a suggestion wich impulseC process the data in BRAM directly not to pass data by
co_memory_readblock/co_memory_writeblock or stream_read/stream_write?

There are a couple ways, all require writing and/or modifying VHDL code to do them. The most correct way would be to tie a co_memory interface (requires a co_memory-to-BRAM wrapper, the "SharedBRAM" example's 'shared_mem.vhd' tries to show this, but might be out of date in some respects) to one port of a BRAM (would need to do connections from within EDK because it will have created the BRAM) and tie the other port to a bus via a BRAM controller in EDK. co_memory's can be accessed directly from within a hardware process using a pointer and not requiring the use of co_memory_read/writeblock().

Thanks,
Ed
mypowlar
Hi,
Can impulseC add this function? I am not familiy with vhdl or verilog.

thanks
etrexel
QUOTE (mypowlar @ Apr 25 2007, 11:07 PM) *
Hi,
Can impulseC add this function? I am not familiy with vhdl or verilog.
thanks

Hi,
This would require a fairly specific PSP to do everything automatically, however, there isn't a great demand for it at the moment (it would be nice to have though). One of the many projects I am working on may benefit from a similar shared memory arrangement, but it would require extra steps in EDK and I do not know when I would have something I could give you to use - the earliest would be next week, but I cannot make any promises at the moment. Please note that in a shared memory arrangement such as this, just like in a system with a DMA, the CPU's cache will need to be managed via flushing (to force writes) and invalidating (to force reads) in order to pass data correctly between the CPU and an external master that is reading/writing the same memory.
Typically streams are used to populate arrays and with the HW_STREAM_READ/WRITE() macros, data transfers are more like a memcpy() where the destination address doesn't change. This can be done from within Impulse C as part of the main process or as a separate process(es) that read/write global arrays.
In the meantime, it may be worthwhile to look at separating the CPU polling of the co_signal from the bus that the DMA is using using the APU interface.

Ed
mypowlar
Can you give a demo wich impulseC process the data in BRAM directly with no need to pass data?
etrexel
QUOTE (mypowlar @ May 17 2007, 01:02 AM) *
Can you give a demo wich impulseC process the data in BRAM directly with no need to pass data?


I haven't gotten as far as I'd hoped, but do have a little more time now to see what I can put together quickly - still can't promise a timeframe just yet, but I do what I can. What would be the minimum that would be useful for what you are doing?
I was last close to having a co_memory interface that uses only uses the co_memory_ptr() and a pointer (2-cycle access of single-words, no co_memory_read/writeblock() support) to directly access a 32-bit wide BRAM that is shared with the PowerPC over the PLB (slave mode). The intent was to use co_signal's to communicate between the PowerPC and hardware process that data was available for processing and data has been processed. This arrangement currently does take a few extra steps in EDK as well as a quick edit of the generated files to expose the co_memory interface and is also limited (which is to be corrected) to a single process accessing the co_memory.

Let me know if this might be useful to you,
Thanks,
Ed
mypowlar
yes,I think it is useful for me. What I want to do is like followed:

1.cpu(powerpc) copy 4k(or so) data from sdram to BRAM, then send a signal to fpga IP core which implemented by impulseC.
2.the fpga IP core will do the algorithm and send a signal to powerpc when it complete it's task.
3.powerpc use the data in BRAM which have processed by fpag IP.
etrexel
Click to view attachment
QUOTE (mypowlar @ May 18 2007, 04:37 AM) *
yes,I think it is useful for me. What I want to do is like followed:

1.cpu(powerpc) copy 4k(or so) data from sdram to BRAM, then send a signal to fpga IP core which implemented by impulseC.
2.the fpga IP core will do the algorithm and send a signal to powerpc when it complete it's task.
3.powerpc use the data in BRAM which have processed by fpag IP.

Hi,
I was able to put something together into a quick PSP to help avoid having to manually edit any files within EDK and have attached the PSP (not part of the formal release, however, it also won't get overwritten), instructions for install and running through the base Impulse C project (it does what you're after using signals) that is also attached. Basically, the PSP is the same as the "Xilinx Virtex-4 PLB (VHDL)" PSP except that for a co_memory it will expose a XIL_BRAM-type interface that can be directly connected to one side of a 'block_bram' in EDK and then the other end can be conencted to just about anything. Access to the BRAM via the co_memory interface is limited to just a pointer (no co_memory_read/writeblock support) from a single process - please see the beginning of the "HowTo" doc for for more info.

Hope this helps get you going,
Thanks,
Ed
mypowlar
thank you very much for your help,I think it is very useful for my work,and
i still make some test for your project.
now i have a problem fowllowed:
In your "HowTo.doc"
6) Configure the ‘plb_bram_comem_block’:
a. Check the “Support PLB Burst aand Cache Line Transfers” box:
b. Change c_baseaddr to 0xA000000
c. Change c_highaddr to 0xA0003fff (this also determines how much BRAM is created)

now, i change c_baseaddr to 0xffff0000 and Change c_highaddr to 0xffff3fff.
Accord to a. Check the “Support PLB Burst aand Cache Line Transfers” box:,
can i use XCache_EnableICache(0x00000001); XCache_EnableDCache(0x00000001);
to enable plb_bram_comem_block cacheable?
I found it can't,but i don't know why it can't cacheable.
mypowlar
In the "HowTo.doc",
- co_memory access is limited to:
o 32-bit words ONLY
o Only the use of the pointer returned from co_memory_ptr() may be used, co_memory_readblock() and co_memory_writeblock() are NOT supported
o Currently only ONE hardware process may access the co_memory

Can you modify to support 16-bit ? I need it to support 16-bit.
it is best to support 8-bit,16-bit and 32-bit.
etrexel
QUOTE (mypowlar @ May 21 2007, 01:26 AM) *
now, i change c_baseaddr to 0xffff0000 and Change c_highaddr to 0xffff3fff.
Accord to a. Check the "Support PLB Burst aand Cache Line Transfers" box:,
can i use XCache_EnableICache(0x00000001); XCache_EnableDCache(0x00000001);
to enable plb_bram_comem_block cacheable?
I found it can't,but i don't know why it can't cacheable.


The shared memory must be uncacheable or it would require a lot of cache management in order to make sure the data written by the CPU is copied into the shared memory before the CPU sends the "start" signal to the hw process as well as make sure the CPU isn't reading "stale" data from the cache. The CPU's cache controller is only aware of changes to memory done by the CPU because the CPU goes thropugh the cache controller when it reads/writes data to memory.

Ed
etrexel
QUOTE (mypowlar @ May 21 2007, 02:10 AM) *
In the "HowTo.doc",
- co_memory access is limited to:
o 32-bit words ONLY
o Only the use of the pointer returned from co_memory_ptr() may be used, co_memory_readblock() and co_memory_writeblock() are NOT supported
o Currently only ONE hardware process may access the co_memory

Can you modify to support 16-bit ? I need it to support 16-bit.
it is best to support 8-bit,16-bit and 32-bit.


This was a "quick" adaptation of something from a project that happened to use 32-bits (convenient bus width) and is very thin. Ideally (and as I had time to refine it) it would do dynamic bus sizing of widths from a single byte to the maximum bus width, but that adds some level of complexity. When I get a chance, I'll look into making a version that supports 16-bit (may have to be a fixed size).

Ed
mypowlar
When will you get a chance to make a version that supports 16-bit (may have to be a fixed size)?
I look forward to receiving it from you soon.
etrexel
QUOTE (mypowlar @ May 21 2007, 06:25 PM) *
When will you get a chance to make a version that supports 16-bit (may have to be a fixed size)?
I look forward to receiving it from you soon.


Here you go, this PSP will do 8, 16, and 32-bit accesses. To do 16-bit, just change the pointer type in the previous example to:
co_int16 *memblkPtr;

All previous notes still apply as does the "How To" doc.

Ed
mypowlar
co_int16 *memblkPtr;
co_int16* p1;
co_int16* p2;

memblkPtr = co_memory_ptr(memblk);
p1 = memblkPtr; //is ok

p2 = memblkPtr+256;// in menu project->Generate HDL
SharedMem_hw.c:60: Unexpected pointer assignment
iMake: *** [SharedBRAM.xic] Error 1

p2 = &memblkPtr[256];// in menu project->Generate HDL
Expecting a memory object: memblkPtr
iMake: *** [SharedBRAM.xhw] Error 1

can you tell me how to assignment p2 to &memblkPtr[256] just like in common c compiler?

thanks
etrexel
QUOTE (mypowlar @ May 22 2007, 02:16 AM) *
co_int16 *memblkPtr;
co_int16* p1;
co_int16* p2;

memblkPtr = co_memory_ptr(memblk);
p1 = memblkPtr; //is ok

p2 = memblkPtr+256;// in menu project->Generate HDL
SharedMem_hw.c:60: Unexpected pointer assignment
iMake: *** [SharedBRAM.xic] Error 1

p2 = &memblkPtr[256];// in menu project->Generate HDL
Expecting a memory object: memblkPtr
iMake: *** [SharedBRAM.xhw] Error 1

can you tell me how to assignment p2 to &memblkPtr[256] just like in common c compiler?

thanks


The co_memory interface, specifically the use of co_memory_ptr() and a pointer, is still fairly new and "special" due to the external interactions necessary. Though likely to be supported in the future more like pointers are for local arrays (see "Pointer Support in Hardware Processes" in the CoDeveloper User Gudie), currently only offsets from the base pointer returned by co_memory_ptr() are allowed. To do what you're after can be done with offsets/indexes rather than pointers:

co_int16 *memblkPtr;
co_int16 p1_Idx;
co_int16 p2_Idx;
co_int16 tmp;

memblkPtr = co_memory_ptr(memblk);

p1_Idx= 0; // p1 = memblkptr
p2_Idx= 256; // p2 = memblkptr + 256

memblkPtr[p1_Idx++] = 5; // *(p1++)=5
tmp = memblkPtr[p2_Idx]; // tmp = *p2

Ed
mypowlar
I have tested the comem_ptr, but it still slower than which not use your IP but let bram cached .
Can you change the IP to let shared memory support cacheable?
etrexel
QUOTE (mypowlar @ May 23 2007, 09:57 AM) *
I have tested the comem_ptr, but it still slower than which not use your IP but let bram cached .
Can you change the IP to let shared memory support cacheable?


Caching is not a function of the IP, but that of the CPU. There may be speed gains to be had by manually managing the data cache (instuction cache shouldn't matter since you're not executing from it) of the CPU, you will need read up on PowerPC assembly instructions such as 'dcba', 'dcbf', 'dcbi', etc.

Lastly, is the BRAM being used for anything else such as your program code and/or stack or heap? That would make a big difference.

Ed
RalphBodenner
Another way to use the load/store function of co_memory is to assign each pointer the result of co_memory_ptr directly, then use it like an array reference:

CODE
co_int16* p1;
  co_int16* p2;
  co_int16 sum;
    
  p1 = co_memory_ptr(memblk);
  p2 = co_memory_ptr(memblk);

  sum = p1[0] + p2[16] + 256;


See the PointerSort example for some code that uses co_memory_ptr.

Regards,
Ralph
mypowlar
Can you give a edk IP core to implemet a DMA from PLB SDRAM to PLB BRAM just like many other CPUs offering?
etrexel
QUOTE (mypowlar @ May 29 2007, 09:26 AM) *
Can you give a edk IP core to implemet a DMA from PLB SDRAM to PLB BRAM just like many other CPUs offering?


EDK already comes with what you are looking for, it is an IP core called 'plb_central_dma', it appears under "IP Catalog" in the "DMA" section as "PLB Central DMA" and will also incude the necessary drivers for the PowerPC.

Ed
mypowlar
now, I have two brams BRAM_X and BRAM_Y.
I want to implement a IP to process data in BRAM_X and save result to BRAM_Y.
Can you give a code for it? thanks!

(the whole data stream is sending data to PORT_A of BRAM_X,
the IP process data in BRAM_X through PORT_B of BRAM_X and sending result
to PORT_A of BRAM_Y, the powerpc then get the result through PORT_B
of BRAM_Y.)


In addition,can you give me a asynchronous FIFO code to interface with PORT_A
of BRAM_X in edk?
etrexel
QUOTE (mypowlar @ Jun 25 2007, 08:12 PM) *
now, I have two brams BRAM_X and BRAM_Y.
I want to implement a IP to process data in BRAM_X and save result to BRAM_Y.
Can you give a code for it? thanks!

The BRAMs in the previous EDK example are used in full dual-port mode, one side accessed from the CPU and the other from hardware processes. Unfortunately it and the PSP it was based on are currently limited to a single process (due to the co_memory_ptr interface) acessing a single co_memory/BRAM (due to the PSP itself), it wouldn't be trivial to add another BRAM interface. If I have time, I will look into it to see what is necessary, but I cannot make any promise it will happen.

QUOTE
In addition,can you give me a asynchronous FIFO code to interface with PORT_A
of BRAM_X in edk?

This may be eaiser done using a stream. If the read/write to the BRAMs through the co_memory interface are already linear, then you may be able to use a stream by replacing the memory accesses with co_stream_read/write()'s. If the data is random and maybe the PowerPC can do the accessing as a FIFo assuming it is writing to PORT_A on BRAM_X. Otherwise, what is writing data to BRAM_X through PORT_A and how is it being read from it on PORT_B?

Ed
mypowlar
"I will look into it to see what is necessary, but I cannot make any promise it will happen."


How soon will you do it? I am eager to test it .
etrexel
QUOTE (mypowlar @ Jun 26 2007, 07:22 PM) *
"I will look into it to see what is necessary, but I cannot make any promise it will happen."
How soon will you do it? I am eager to test it .

Hi,
Luckily it wasn't as difficult as I thought it could have become and was able to add support for more co_memory<->BRAM interfaces with the small amount of time I had available to work on this. Currently the number of interfaces is set to 2 and should be able to be increased if necessary, see note at top of '_hw.c. Attached are the results, a new set of PSP files and an example project. Instructions are the same as before in EDK, however, do note that there are now multiple ports. The memory name "xil_bram<n>" passed to co_memory_create() mapping to "PORTA<n>" can be seen in comments in the generated MPD file.


Ed
mypowlar
first,thank you for your demo.

I think you have not understand me.
the demo you give me is that the IP processing data and
save data in the same bram, while another bram is not
used right for me.

what's i needed is that the orignal data is in bram_1,
and the IP read data from bram_1 and save processed data
to bram_2.
etrexel
QUOTE (mypowlar @ Jun 28 2007, 08:43 AM) *
first,thank you for your demo.

I think you have not understand me.
the demo you give me is that the IP processing data and
save data in the same bram, while another bram is not
used right for me.

what's i needed is that the orignal data is in bram_1,
and the IP read data from bram_1 and save processed data
to bram_2.

My main objective was to get you past the hard part of getting the PSP to generate multiple co_memory<->BRAM interfaces. To get the the two co_memory's accessible from the same process, all you need to do is add it as a parameter. In the example, just remove hw_proc2() (and all related configuration) and then add 'themem2' to be passed to hw_proc() and change 'themem' to 'themem2' being passed into consumer(). Highlights of necessary code changes:
CODE

void hw_proc(co_signal start, co_memory memblk, co_memory memblk2, co_signal done)
{
// create new pointer to access memblk2 and just like to access 'memblk'
}
...
void config_BRAM_CoMem(void *arg)
{
...
pe_proc=co_process_create("hw_proc",(co_function)hw_proc,4,startsig,themem,themem2,donesig);
...
consumer_proc=co_process_create("consumer_proc",(co_function)consumer,2,donesig,themem2);
...
}


Ed
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2010 Invision Power Services, Inc.