head 1.3; access; symbols bg2_23:1.2 bg2_22:1.2 bg2_21:1.2 bg2_20:1.2 bg2_16:1.2 bg2_15:1.2 bg2_12:1.2 bg2_07:1.2 isorc2008_submission:1.2 handbook_alpha_edition:1.2; locks; strict; comment @% @; 1.3 date 2008.08.19.16.19.57; author martin; state Exp; branches; next 1.2; commitid 599148aaf29f4567; 1.2 date 2007.11.05.12.43.28; author martin; state Exp; branches; next 1.1; commitid 50ee472f0fbc4567; 1.1 date 2007.08.25.17.58.55; author martin; state Exp; branches; next ; commitid c3646d06dd24567; desc @@ 1.3 log @Corrections as suggested by Trevor @ text @ \section{HW/SW Codesign} \label{sec:hwsw:co} Using a hardware description language and loading the design in an FPGA the former strict border between hardware and software gets blurred. Is configuring an FPGA not more like loading a program for execution? This looser distinction makes it possible to move functions easily between hardware and software resulting in a highly configurable design. If speed is an issue, more functions are realized in hardware. If cost is the primary concern these functions are moved to software and a smaller FPGA can be used. Let us examine these possibilities on a relatively expensive function: \emph{multiplication}. Bytecode \code{imul} performs a 32 bit signed multiplication with a 32 bit result. There are no exceptions on overflow. Since 32 bit single cycle multiplications are far beyond the possibilities of current, mainstream FPGAs the first solution is a sequential multiplier. \paragraph{Sequential Booth Multiplier in VHDL} \begin{lstlisting}[float, caption={Booth multiplier in VHDL}, language=VHDL, label=lst:arch:hwsw:vhdl] process(clk, wr_a, wr_b) variable count : integer range 0 to width; variable pa : signed(64) downto 0); variable a_1 : std_logic; alias p : signed(32 downto 0) is pa(64 downto 32); begin if rising_edge(clk) then if wr_a='1' then p := (others => '0'); pa(width-1 downto 0) := signed(din); elsif wr_b='1' then b <= din; a_1 := '0'; count := width; else if count > 0 then case std_ulogic_vector'(pa(0), a_1) is when "01" => p := p + signed(b); when "10" => p := p - signed(b); when others => null; end case; a_1 := pa(0); pa := shift_right(pa, 1); count := count - 1; end if; end if; end if; dout <= std_logic_vector(pa(31 downto 0)); end process; \end{lstlisting} % Listing~\ref{lst:arch:hwsw:vhdl} shows the VHDL code of the multiplier. Two microcode instructions are used to access this function: \code{stmul} stores the two operands (from TOS and TOS-1) and starts the sequential multiplier. After 33 cycles, the result is loaded with \code{ldmul}. Listing~\ref{lst:arch:hwsw:micro} shows the microcode for \code{imul}. \begin{lstlisting}[float, caption={Microcode to access the Booth multiplier}, label=lst:arch:hwsw:micro] imul: stmul // store both operands and start pop // pop second operand ldi 5 // 6*5+3 cycles wait imul_loop: // wait loop dup nop bnz imul_loop ldi -1 // decrement in branch slot add pop // remove counter ldmul nxt // load result \end{lstlisting} \paragraph{Multiplication in Microcode} If we run out of resources in the FPGA, we can move the function to microcode. The implementation of \code{imul} is almost identical to the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73 microcode instructions. \paragraph{Bytecode imul in Java} Microcode is stored in an embedded memory block of the FPGA. This is also a resource of the FPGA. We can move the code to external memory by implementing \code{imul} in Java bytecode. Bytecodes not implemented in microcode result in a static Java method call from a special class (\code{com.jopdesign.sys.JVM}). This class has prototypes for each bytecode ordered by the bytecode value. This allows us to find the right method by indexing the method table with the value of the bytecode. Listing~\ref{lst:arch:hwsw:java} shows the Java method for \code{imul}. The additional overhead for this implementation is a call and return with cache refills. \begin{lstlisting}[float, caption={Implementation of bytecode \code{imul} in Java}, label=lst:arch:hwsw:java] public static int imul(int a, int b) { int c, i; boolean neg = false; if (a<0) { neg = true; a = -a; } if (b<0) { neg = !neg; b = -b; } c = 0; for (i=0; i<32; ++i) { c <<= 1; if ((a & 0x80000000)!=0) c += b; a <<= 1; } if (neg) c = -c; return c; } \end{lstlisting} \paragraph{Implementations Compared} \tablename~\ref{tab_arch_hwsw_compared} lists the resource usage and execution time for the three implementations. Execution time is measured with both operands negative, the worst-case execution time for the software implementations. The implementation in Java is slower than the microcode implementation as the Java method is loaded from main memory into the bytecode cache. \begin{table} \centering \begin{tabular}{ld{2}d{3}d{0}} \toprule & \cc{Hardware} & \cc{Microcode} & \cc{Time} \\ & \cc{[LC]} & \cc{[Byte]} & \cc{[Cycle]} \\ \midrule VHDL & 156 & 10 & 35 \\ Microcode & 0 & 73 & 750 \\ Java & 0 & 0 & ~2,300 \\ \bottomrule \end{tabular} \caption{Different implementations of \code{imul} compared} \label{tab_arch_hwsw_compared} \end{table} Only a few lines of code have to be changed to select one of the three implementations. This principle can also be applied to other expensive bytecodes: e.g.\ \code{idiv}, \code{ishr}, \code{iushr} and \code{ishl}. As a result, the resource usage of JOP is highly configurable and can be selected for each application according to the needs of the application. Treating VHDL as a software language allows easy movement of function blocks between hardware and software. @ 1.2 log @Handbook alpha edition @ text @d18 3 a20 3 In Java bytecode \code{imul} performs a 32 bit signed multiplication with a 32 bit result. There are no exceptions on overflow. Since 32 bit single cycle multiplications are far beyond the possibilities of d95 2 a96 2 microcode. The implementation of \code{imul} is almost identical with the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73 d164 7 a170 7 three implementations. The shown principle can also be applied to other expensive bytecodes: e.g.\ \code{idiv}, \code{ishr}, \code{iushr} and \code{ishl}. As a result, the resource usage of JOP is highly configurable and can be selected for each application according to the needs of the application. Treating VHDL as a software language allows easy movement of function blocks between hardware and software. @ 1.1 log @Handbook update @ text @d15 2 a16 1 possibilities on a relatively expensive function: multiplication. @