Hi
Currently, I have a hardware accelerated application running as a single HW process in FPGA.
The speed-up factor is around 60x (without instruction level optimisation) compared to software running on MicroBlaze.
The speed-up factor could go up to around 100x (with instruction level pipelining).
But the 100x speed-up is not enough. Requires around 250x or greater.
Also, realised that the speed-up factor (100x) will not go up dramatically further with recommended instruction level optimisation techniques.
So, thinking of changing current instruction level pipeline loop to system level pipeline with multiple HW processes by passing dependent variables from one pipeline stage to another.
But not very sure of the speed-up that could be obtained from system level parallelism. Could it dramatically improve the speed-up (say from 100x to required speed-up of 250x or better)? OR what is estimated speed-up ratio between instruction level and system level parallelism in general (worst-case)?
Also, what is the recommended interface between pipeline stages (stream or register) if I would like to have (say 100 or more) large HW processes?
Thanks in advance.
With regards
Yan Lin Aung (Nanyang Technological University, Singapore)
Speed-up Ratio between Instruction Level and System Level Parallelism
Started by yan_lin_aung, Jan 17 2007 08:24 PM
2 replies to this topic
#1
Posted 17 January 2007 - 08:24 PM
#2
Posted 22 January 2007 - 04:20 PM
Hi,
- If your hardware process is already a single pipeline and your incoming data isn't recursive and can be processed in parallel, then adding more processes in parallel multplies the speed linearly by the total number of processes - two processes will be 2x the speed of 1, etc. at the cost of more hardware.
- If your hardware process isn't a single pipeline, but is made up of multiple pipelined loops, then moving each loop into its own process effectively creates a system level pipeline. Instead of each loop waiting for the results of the last, they can now all run in parallel which will yield a pipeline with a rate roughly equal to the time it takes the slowest process from input to output. The entire chain of processes again can be reproduced in parallel for a linear increase in speed, or maybe just the slowest process needs to be replicated in parallel bringing up the average speed of that process to increase the speed of the system.
Assuming your processes are already pipelined accepting serial data, steams are probably the better choice than registers. Streams are already serialized for a pipeline, the blocking read/write functions on automatically synchronize your data in and out, and the depth of the stream's buffer can be changed to help tune the data flow through the system. It is also easy to create a multiplexer or demultiplexer process that reads from many streams and writes to one or reads from one and writes to many for handling multiple parallel processes.
If you would like recommendations specific to your code, you are welcome to email it to support@impulsec.com and we will review it for you.
Thanks,
Ed
QUOTE
But not very sure of the speed-up that could be obtained from system level parallelism. Could it dramatically improve the speed-up (say from 100x to required speed-up of 250x or better)? OR what is estimated speed-up ratio between instruction level and system level parallelism in general (worst-case)?
As with everything, it will depend on your code and data flow. But generally:- If your hardware process is already a single pipeline and your incoming data isn't recursive and can be processed in parallel, then adding more processes in parallel multplies the speed linearly by the total number of processes - two processes will be 2x the speed of 1, etc. at the cost of more hardware.
- If your hardware process isn't a single pipeline, but is made up of multiple pipelined loops, then moving each loop into its own process effectively creates a system level pipeline. Instead of each loop waiting for the results of the last, they can now all run in parallel which will yield a pipeline with a rate roughly equal to the time it takes the slowest process from input to output. The entire chain of processes again can be reproduced in parallel for a linear increase in speed, or maybe just the slowest process needs to be replicated in parallel bringing up the average speed of that process to increase the speed of the system.
QUOTE
Also, what is the recommended interface between pipeline stages (stream or register) if I would like to have (say 100 or more) large HW processes?
Assuming your processes are already pipelined accepting serial data, steams are probably the better choice than registers. Streams are already serialized for a pipeline, the blocking read/write functions on automatically synchronize your data in and out, and the depth of the stream's buffer can be changed to help tune the data flow through the system. It is also easy to create a multiplexer or demultiplexer process that reads from many streams and writes to one or reads from one and writes to many for handling multiple parallel processes.
If you would like recommendations specific to your code, you are welcome to email it to support@impulsec.com and we will review it for you.
Thanks,
Ed
Ed Trexel
Impulse Accelerated Technologies, Inc.
Impulse Accelerated Technologies, Inc.
#3
Posted 23 January 2007 - 09:11 AM
Hi
Thank you very much for the reply. It is very helpful.
I will try it out first.
With regards
Yan Lin Aung (NTU, Singapore)
Thank you very much for the reply. It is very helpful.
I will try it out first.
With regards
Yan Lin Aung (NTU, Singapore)
QUOTE (etrexel @ Jan 23 2007, 08:20 AM) <{POST_SNAPBACK}>
Hi,
As with everything, it will depend on your code and data flow. But generally:
- If your hardware process is already a single pipeline and your incoming data isn't recursive and can be processed in parallel, then adding more processes in parallel multplies the speed linearly by the total number of processes - two processes will be 2x the speed of 1, etc. at the cost of more hardware.
- If your hardware process isn't a single pipeline, but is made up of multiple pipelined loops, then moving each loop into its own process effectively creates a system level pipeline. Instead of each loop waiting for the results of the last, they can now all run in parallel which will yield a pipeline with a rate roughly equal to the time it takes the slowest process from input to output. The entire chain of processes again can be reproduced in parallel for a linear increase in speed, or maybe just the slowest process needs to be replicated in parallel bringing up the average speed of that process to increase the speed of the system.
Assuming your processes are already pipelined accepting serial data, steams are probably the better choice than registers. Streams are already serialized for a pipeline, the blocking read/write functions on automatically synchronize your data in and out, and the depth of the stream's buffer can be changed to help tune the data flow through the system. It is also easy to create a multiplexer or demultiplexer process that reads from many streams and writes to one or reads from one and writes to many for handling multiple parallel processes.
If you would like recommendations specific to your code, you are welcome to email it to support@impulsec.com and we will review it for you.
Thanks,
Ed
As with everything, it will depend on your code and data flow. But generally:
- If your hardware process is already a single pipeline and your incoming data isn't recursive and can be processed in parallel, then adding more processes in parallel multplies the speed linearly by the total number of processes - two processes will be 2x the speed of 1, etc. at the cost of more hardware.
- If your hardware process isn't a single pipeline, but is made up of multiple pipelined loops, then moving each loop into its own process effectively creates a system level pipeline. Instead of each loop waiting for the results of the last, they can now all run in parallel which will yield a pipeline with a rate roughly equal to the time it takes the slowest process from input to output. The entire chain of processes again can be reproduced in parallel for a linear increase in speed, or maybe just the slowest process needs to be replicated in parallel bringing up the average speed of that process to increase the speed of the system.
Assuming your processes are already pipelined accepting serial data, steams are probably the better choice than registers. Streams are already serialized for a pipeline, the blocking read/write functions on automatically synchronize your data in and out, and the depth of the stream's buffer can be changed to help tune the data flow through the system. It is also easy to create a multiplexer or demultiplexer process that reads from many streams and writes to one or reads from one and writes to many for handling multiple parallel processes.
If you would like recommendations specific to your code, you are welcome to email it to support@impulsec.com and we will review it for you.
Thanks,
Ed
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












