head	1.3;
access;
symbols
	bg2_23:1.2
	bg2_22:1.2
	bg2_21:1.2
	bg2_20:1.2
	bg2_16:1.2
	bg2_15:1.2
	bg2_12:1.2
	bg2_07:1.2
	isorc2008_submission:1.2
	handbook_alpha_edition:1.2;
locks; strict;
comment	@% @;


1.3
date	2008.08.19.16.19.57;	author martin;	state Exp;
branches;
next	1.2;
commitid	599148aaf29f4567;

1.2
date	2007.11.05.12.43.28;	author martin;	state Exp;
branches;
next	1.1;
commitid	50ee472f0fbc4567;

1.1
date	2007.08.25.17.58.55;	author martin;	state Exp;
branches;
next	;
commitid	c3646d06dd24567;


desc
@@


1.3
log
@Corrections as suggested by Trevor
@
text
@
\section{HW/SW Codesign}
\label{sec:hwsw:co}

Using a hardware description language and loading the design in an
FPGA the former strict border between hardware and software gets
blurred. Is configuring an FPGA not more like loading a program for
execution?

This looser distinction makes it possible to move functions easily
between hardware and software resulting in a highly configurable
design. If speed is an issue, more functions are realized in
hardware. If cost is the primary concern these functions are moved
to software and a smaller FPGA can be used. Let us examine these
possibilities on a relatively expensive function:
\emph{multiplication}.

Bytecode \code{imul} performs a 32 bit signed multiplication with a
32 bit result. There are no exceptions on overflow. Since 32 bit
single cycle multiplications are far beyond the possibilities of
current, mainstream FPGAs the first solution is a sequential
multiplier.

\paragraph{Sequential Booth Multiplier in VHDL}

\begin{lstlisting}[float, caption={Booth multiplier in VHDL},
language=VHDL, label=lst:arch:hwsw:vhdl]
    process(clk, wr_a, wr_b)

        variable count  : integer range 0 to width;
        variable pa     : signed(64) downto 0);
        variable a_1    : std_logic;
        alias p         : signed(32 downto 0)
                          is pa(64 downto 32);

    begin
        if rising_edge(clk) then
            if wr_a='1' then
                p := (others => '0');
                pa(width-1 downto 0) := signed(din);

            elsif wr_b='1' then
                b <= din;
                a_1 := '0';
                count := width;
            else
                if count > 0 then
                    case std_ulogic_vector'(pa(0), a_1) is
                        when "01" =>
                            p := p + signed(b);
                        when "10" =>
                            p := p - signed(b);
                        when others =>
                            null;
                    end case;
                    a_1 := pa(0);
                    pa := shift_right(pa, 1);
                    count := count - 1;
                end if;
            end if;
        end if;
        dout <= std_logic_vector(pa(31 downto 0));
    end process;
\end{lstlisting}
%
Listing~\ref{lst:arch:hwsw:vhdl} shows the VHDL code of the
multiplier. Two microcode instructions are used to access this
function: \code{stmul} stores the two operands (from TOS and TOS-1)
and starts the sequential multiplier. After 33 cycles, the result is
loaded with \code{ldmul}. Listing~\ref{lst:arch:hwsw:micro} shows
the microcode for \code{imul}.

\begin{lstlisting}[float, caption={Microcode to access the Booth multiplier},
label=lst:arch:hwsw:micro]
    imul:
            stmul       // store both operands and start
            pop         // pop second operand

            ldi 5       // 6*5+3 cycles wait
imul_loop:              // wait loop
            dup
            nop
            bnz imul_loop
            ldi -1      // decrement in branch slot
            add

            pop         // remove counter

            ldmul   nxt // load result
\end{lstlisting}

\paragraph{Multiplication in Microcode}

If we run out of resources in the FPGA, we can move the function to
microcode. The implementation of \code{imul} is almost identical to
the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73
microcode instructions.

\paragraph{Bytecode imul in Java}

Microcode is stored in an embedded memory block of the FPGA. This is
also a resource of the FPGA. We can move the code to external memory
by implementing \code{imul} in Java bytecode. Bytecodes not
implemented in microcode result in a static Java method call from a
special class (\code{com.jopdesign.sys.JVM}). This class has
prototypes for each bytecode ordered by the bytecode value. This
allows us to find the right method by indexing the method table with
the value of the bytecode. Listing~\ref{lst:arch:hwsw:java} shows
the Java method for \code{imul}. The additional overhead for this
implementation is a call and return with cache refills.


\begin{lstlisting}[float, caption={Implementation of bytecode \code{imul} in Java},
label=lst:arch:hwsw:java]
    public static int imul(int a, int b) {

        int c, i;
        boolean neg = false;
        if (a<0) {
            neg = true;
            a = -a;
        }
        if (b<0) {
            neg = !neg;
            b = -b;
        }
        c = 0;
        for (i=0; i<32; ++i) {
            c <<= 1;
            if ((a & 0x80000000)!=0) c += b;
            a <<= 1;
        }
        if (neg) c = -c;
        return c;
    }
\end{lstlisting}

\paragraph{Implementations Compared}

\tablename~\ref{tab_arch_hwsw_compared} lists the resource usage and
execution time for the three implementations. Execution time is
measured with both operands negative, the worst-case execution time
for the software implementations. The implementation in Java is
slower than the microcode implementation as the Java method is
loaded from main memory into the bytecode cache.

\begin{table}
    \centering
    \begin{tabular}{ld{2}d{3}d{0}}
    \toprule
    & \cc{Hardware} & \cc{Microcode} & \cc{Time} \\
    & \cc{[LC]} & \cc{[Byte]} & \cc{[Cycle]} \\
    \midrule
    VHDL & 156 & 10 & 35 \\
    Microcode & 0 & 73 & 750 \\
    Java & 0 & 0 & ~2,300 \\
    \bottomrule
    \end{tabular}
    \caption{Different implementations of \code{imul} compared}
    \label{tab_arch_hwsw_compared}
\end{table}

Only a few lines of code have to be changed to select one of the
three implementations. This principle can also be applied to other
expensive bytecodes: e.g.\ \code{idiv}, \code{ishr}, \code{iushr} and
\code{ishl}. As a result, the resource usage of JOP is highly
configurable and can be selected for each application according to
the needs of the application. Treating VHDL as a software language
allows easy movement of function blocks between hardware and
software.
@


1.2
log
@Handbook alpha edition
@
text
@d18 3
a20 3
In Java bytecode \code{imul} performs a 32 bit signed multiplication
with a 32 bit result. There are no exceptions on overflow. Since 32
bit single cycle multiplications are far beyond the possibilities of
d95 2
a96 2
microcode. The implementation of \code{imul} is almost identical
with the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73
d164 7
a170 7
three implementations. The shown principle can also be applied to
other expensive bytecodes: e.g.\ \code{idiv}, \code{ishr},
\code{iushr} and \code{ishl}. As a result, the resource usage of JOP
is highly configurable and can be selected for each application
according to the needs of the application. Treating VHDL as a
software language allows easy movement of function blocks between
hardware and software.
@


1.1
log
@Handbook update
@
text
@d15 2
a16 1
possibilities on a relatively expensive function: multiplication.
@