head	1.4;
access;
symbols;
locks; strict;
comment	@% @;


1.4
date	2007.09.14.00.29.47;	author martin;	state dead;
branches;
next	1.3;
commitid	2b6346e9d5f94567;

1.3
date	2007.03.25.01.07.53;	author martin;	state Exp;
branches;
next	1.2;
commitid	a1a4605cb4f4567;

1.2
date	2005.12.20.13.20.53;	author martin;	state Exp;
branches;
next	1.1;
commitid	260d43a805324567;

1.1
date	2005.11.28.15.39.45;	author martin;	state Exp;
branches;
next	;
commitid	3083438b24bf4567;


desc
@@


1.4
log
@moved to JOP handbook
@
text
@\documentclass[a4paper,12pt]{scrartcl}
\usepackage{pslatex} % -- times instead of computer modern

\usepackage[colorlinks=true,linkcolor=black,citecolor=black]{hyperref}
\usepackage{booktabs}
\usepackage{graphicx}

\usepackage[latin1]{inputenc}

\newcommand{\code}[1]{{\textsf{#1}}}
\newcommand{\sign}[1]{{\texttt{#1}}}


\begin{document}

\title{SimpCon -- a Simple SoC Interconnect\\Draft}
\author{Martin Schoeberl\\ martin@@jopdesign.com}
\maketitle \thispagestyle{empty}

\begin{abstract}

This document proposes a simple interconnection standard for
system-on-chip (SoC) components. It is intended to provide pipelined
access to devices such on-chip peripherals and on-chip memory
controller with minimum hardware resources.


\end{abstract}

\section{Introduction}

The intention of the following SoC interconnect standard is to be
simple and efficient with respect to implementation resources and
transaction latency.

SimpCon is a fully synchronous standard for on-chip
interconnections. It is a point-to-point connection between a master
and a slave. The master starts either a read or write transaction.
Master commands are single cycle to free the master to continue on
internal operations during an outstanding transaction. The slave has
to register the address when needed for more than one cycle. The
slave also registers the data on a read and provides it to the
master for more than a single cycle. This property allows the master
to delay the actual read if it is busy with internal operations.

The slave signals the end of the transaction through a novel
\emph{ready counter} to provide an early notification. This early
notification simplifies the integration of peripherals into
pipelined masters.

Slaves can also provide several levels of pipelining. This feature
is announced by two static output ports (one for read and one write
pipeline levels).

Off-chip connections (e.g.\ main memory) are device specific and
need a slave to perform the translation. Peripheral interrupts are
not covered by this specification.

\subsection{Feature}

\begin{itemize}
    \item Master/slave point-to-point connection
    \item Synchronous operation
    \item Read and write transactions
    \item Early pipeline release for the master
    \item Pipelined transactions
    \item Open-source specification
    \item Low implementation overheads
\end{itemize}

\subsection{Basic Read Transaction}

Figure~\ref{fig:sc:basic:rd} shows a basic read transaction for a
slave with one cycle latency. The acknowledge signals are omitted
from the figure. In the first cycle, the address phase, the
\sign{rd} signals the slave to start the read transaction. The
address is registered by the slave. During the following cycle, the
read phase, the slave performs the read and registers the data. Due
to the register in the slave the data is available in the third
cycle, the result phase. To simplify the master, the read data stays
valid till the next read request response.

\begin{figure}
    \centering
    \includegraphics{figures/sc_basic_rd}
    \caption{Basic read transaction}
    \label{fig:sc:basic:rd}
\end{figure}

\subsection{Basic Write Transaction}

A write transaction consists of a single cycle address/command phase
started by assertion of \sign{wr} where the address and the write
data are valid. \sign{address} and \sign{wr\_data} are usually
registered by the slave. The end of the write cycle is signalled to
the master by the slave with \sign{rdy\_cnt}. See section
\ref{sec:ack} and an example in Figure~\ref{fig:sc:wr:ws}.

\section{SimpCon Signals}

This sections defines the signals used by the SimpCon connection.
Some of the signals are optional and may not be present on a
peripheral device.

All signals are a single direction point-to-point connection between
a master and a slave. The signal details are described by the device
that drives the signal. Table~\ref{tab:sc:signals} lists the signals
that define the SimpCon interface. The column Direction indicates
wether the signal is driven by the master or the slave.

\begin{table}
    \centering

    \begin{tabular}{lrlll}
        \toprule
        Signal & Width & Direction & Required & Description \\
        \midrule
        \sign{address} & 1--32 & Master & No & Address lines from the
        master\\
        & & & & to the slave port\\
        \sign{wr\_data} & 32 & Master & No & Data lines from the
        master\\
        & & & & to the slave port\\
        \sign{rd} & 1 & Master & No & Start of a read transaction \\
        \sign{wr} & 1 & Master & No & Start of a write transaction \\
        \sign{rd\_data} & 32 & Slave & No & Data lines from the
        slave\\
        & & & & to the master port\\
        \sign{rdy\_cnt} & 2 & Slave & Yes & Transaction end signalling \\
        \sign{rd\_pipeline\_level} & 2 & Slave & No & Maximum pipeline
        level\\
        & & & & for read transactions \\
        \sign{wr\_pipeline\_level} & 2 & Slave & No & Maximum pipeline
        level\\
        & & & & for write transactions \\
        \bottomrule

    \end{tabular}
    \caption{SimpCon port signals}
    \label{tab:sc:signals}

\end{table}


\subsection{Master Signal Details}

This section describes the signals that are driven by the master to
initiate a transaction.

\subsubsection{address}

Master addresses represent word addresses as offsets in the slaves
address range. \sign{address} is valid a single cycle either with
\sign{rd} for a read transaction or with \sign{wr} and
\sign{wr\_data} for a write transaction.

The number of bits for \sign{address} depend on the slaves address
range. For a single port slave \sign{address} can be omitted.

\subsubsection{wr\_data}

The \sign{wr\_data} signals carry the data for a write transaction.
It is valid for a single cycle together with \sign{address} and
\sign{wr}. The signal is typically 32 bits wide. Slaves can ignore
upper bits when the slave port is less than 32 bits.

\subsubsection{rd}

The \sign{rd} signal is asserted a single clock cycle to start a
read transaction. \sign{address} has to be valid in the same cycle.

\subsubsection{wr}

The \sign{wr} signal is asserted a single clock cycle to start a
write transaction. \sign{address} and \sign{wr\_data} have to be
valid in the same cycle.

\subsubsection{sel\_byte}

The \sign{sel\_byte} signal is reserved for future versions of the
SimpCon specification to add individual byte enables.

\subsection{Slave Signal Details}

This section describes the signals that are driven by the slave as a
response to transaction initiated by the master.

\subsubsection{rd\_data}

The \sign{wr\_data} signals carry the result for a read transaction.
The data is valid when \sign{rdy\_cnt} reaches 0 and stays valid
till a new read result is available. The signal is typically 32 bits
wide. Slaves that provide less than 32 bits should pad the upper
bits with 0.

\subsubsection{rdy\_cnt}

The \sign{rdy\_cnt} signal provides the number of cycles till the
pending transaction will finish. A 0 means that either read data is
available or a write transaction has been finished. Values of 1 and
2 mean the the transaction will finish in at least 1 or 2 cycles.
The maximum value is 3 and means the the transaction will finish in
3 or \emph{more} cycles. Note that not all values have to be used in
a transaction. Each monotonic sequence of \sign{rdy\_cnt} values is
legal.

\subsubsection{rd\_pipeline\_level}

The static \sign{rd\_pipeline\_level} provides the master with the
read pipeline level of the slave. The signal has to be constant to
enable the synthesizer to optimize the pipeline level dependent
state machine in the master.


\subsubsection{wr\_pipeline\_level}

The static \sign{wr\_pipeline\_level} provides the master with the
write pipeline level of the slave. The signal has to be constant to
enable the synthesizer to optimize the pipeline level dependent
state machine in the master.

\section{Slave Acknowledge}
\label{sec:ack}

Flow control between the slave and the master is usually done by a
single signal in the form of \emph{wait} or \emph{acknowledge}. The
\sign{ack} signal, e.g.\ in the Wishbone specification, is set when
the data is available or the write operation has finished. However,
for a pipelined master it can be of interest to know it
\emph{earlier} when a transaction will finish.

For a lot of slaves, e.g.\ a SRAM interface with fixed wait states,
this information is available inside the slave. In the SimpCon
interface this information is communicated to the master through the
two bit signal \sign{rdy\_cnt}. \sign{rdy\_cnt} signals the number
of cycles till the read data will be available or the write
transaction will be finished. Value 0 is equivalent to an \emph{ack}
signal and 1, 2, and 3 are equivalent to a wait request with the
distinction that the master knows how long the wait request will
last.

To avoid too many signals at the interconnect \sign{rdy\_cnt} has a
width of two bits. Therefore, the maximum value of 3 has the special
meaning that the transaction will finish in 3 or \emph{more} cycles.
As a result the master can only use the values 0, 1, and 2 to
release actions in it's pipeline.

Idle slaves will keep the former value of 0 for \sign{rdy\_cnt}.
Slaves, that don't know in advance how many wait states are need for
the transaction can produce sequences that omit any of the numbers
3, 2, and 1. The master has to handle this situations.

Figure~\ref{fig:sc:rd:ws} shows an example of a slave that needs
three cycles for the read to be processed. In cycle 1 the read
command and the address are set by the master. The slave registers
the address and sets \sign{rdy\_cnt} to 3 in cycle 2. The read takes
three cycles (2--4) during which \sign{rdy\_cnt} is decremented. In
cycle 4 the data is available inside the slave and gets registered.
It is available in cycle 5 for the master and \sign{rdy\_cnt} is
finally 0. Both, the \sign{rd\_data} and \sign{rdy\_cnt} will keep
their value till a new transaction is requested.

\begin{figure}
    \centering
    \includegraphics{figures/sc_rd_ws}
    \caption{Read transaction with wait states}
    \label{fig:sc:rd:ws}
\end{figure}


Figure~\ref{fig:sc:wr:ws} shows an example of a slave that needs
three cycles for the write to be processed. The address, the data to
be written and the write command are valid during cycle 1. The slave
registers the address and write data during cycle 1 and performs the
write operation during cycles 2--4. The \sign{rdy\_cnt} is
decremented and a non-pipelined slave can accept a new command after
cycle 4.

\begin{figure}
    \centering
    \includegraphics{figures/sc_wr_ws}
    \caption{Write transaction with wait states}
    \label{fig:sc:wr:ws}
\end{figure}


\section{Pipelining}

Figure~\ref{fig:sc:pipe:level} shows a read transaction for a slave
with four cycles latency. Without any pipelining the next read
transaction will start in cycle 7 after the data from the former
read transaction is read by the master. The three bottom lines show
when new read transactions will be started for different pipeline
levels. With pipeline level 1 a new transaction can start in the
same cycle when the former read data is available (in this example
in cycle 6). Higher levels mean that the next read will start
earlier as shown for level 2 and 3.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/sc_pipe_level}
    \caption{Different pipeline levels for a read transaction}
    \label{fig:sc:pipe:level}
\end{figure}

Implementation of level 1 in the slave is trivial (just two more
transitions in the state machine). It is recommended to provide
level 1 at least for read transactions. Level 2 is a little bit more
complex but usually no additional address or data registers are
needed.

To implement level 3 pipelining in the slave at least an additional
address register is needed. However, to use level 3 the master has
to issue the request in the same cycle as \sign{rdy\_cnt} goes to 2.
That means this transition is combinatorial. We see in
Figure~\ref{fig:sc:pipe:level} that \sign{rdy\_cnt} value of 3 means
three or more cycles till the data is available and can therefore
not be used to trigger a new transaction.

\section{Multiple Master}

SimpCon defines no signals for the communication between a master
and an arbiter. However, it is possible to build a multi master
system with SimpCon. The SimpCon interface can be used as
interconnect between the masters and the arbiter and the arbiter and
the slaves. In this case the arbiter acts as slave for the master
and as master for the peripheral devices.

The missing arbitration protocol in SimpCon results in the need to
queue $n-1$ requests in an arbiter for $n$ masters. However, for
this additional HW we get zero overheads for the bus request. The
master, which gets the bus will will start the slave transaction in
the same cycle.
\\
\\
TODO: add a timing diagram to explain this concept.


\section{Examples}

This section provides some examples for the application of the
SimpCon definition.

\subsection{IO Port}

TODO: Show how simple an IO port can be with SimpCon. We need no
addresses and can tie \sign{bsy\_cnt} to 0. We only need the
\sign{rd} or \sign{wr} signal to enable the port.

\subsection{SRAM interface}

The following example is taken from an implementation of SimpCon for
a Java processor. The processor is clocked with 100MHz and the main
memory consists of 15ns static RAMs. Therefore the minimum access
time for the RAM is two cycles. The slack time of 5ns forces us to
use output registers for the RAM address and write data and input
registers for the read data in the IO cells of the FPGA. These
registers fit nice with the intention of SimpCon to use registers
inside the slave.

Figure~\ref{fig:sc:sram} shows the interface for a non-pipelined
read access followed by a write access. Four signals are driven by
the master and two signal by the slave. The lower half of the figure
shows the signals at the FPGA pins where the RAM is connected.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/sc_sram}
    \caption{Static RAM interface without pipelining}
    \label{fig:sc:sram}
\end{figure}

In cycle 1 the read transaction is started by the master and the
slave registers the address. The slave also sets the registered
control signals \sign{ncs} and \sign{noe} during cycle1. Due to the
IO cell registers, the address and control signals are valid at the
FPGA pins very early in cycle 2. At the end of cycle 3 (15ns after
\sign{address}, \sign{ncs} and \sign{noe} are stable) the data from
the RAM is available and can be sampled with the rising edge for
cycle 4.

The master reads the data in cycle 4 and starts a write transaction
in cycle 5. Address and data are again registered from the slave and
are available for the RAM at the beginning of cycle 6. To perform a
write in two cycles the nwr signal is registered by a negative
triggered flip-flop.

In figure~\ref{fig:sc:sram:prd} we see a pipelined read from the RAM
with pipeline level 2. With this pipeline level and the two cycles
read access time of the RAM we get the maximum bandwidth possible.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/sc_sram_prd}
    \caption{Pipelined read from a static RAM}
    \label{fig:sc:sram:prd}
\end{figure}

We can see the start of the second read transaction in cycle 3
during the read of the first data from the RAM. The new address is
registered in the same cycle and available for the RAM in the
following cycle 4. Although we have a pipeline level of 2 we need no
additional address or data register. The read data is available for
two cycles (\sign{rdy\_cnt} 2 or 1 for the next read) and the master
is free to select one of the two cycles to read the data.

\subsection{Master Multiplexing}

To add several slaves to a single master the \sign{rd\_data} and
\sign{bsy\_cnt} have to be multiplexed. Due to the fact that all
\sign{rd\_data} signals are registered by the slaves a single
pipeline stage will be enough for a large multiplexer. The selection
of the multiplexer is also known at the transaction start but needed
at most in the next cycle. Therefore it can be registered to further
speed up the multiplexer.
\\
\\
TODO: add a schematic for the master \sign{rd\_data} multiplexer.

\section{Status}

\begin{itemize}
    \item First timing diagrams drawn
    \item SimpCon SRAM interface for JOP on Cyclone and Spartan-3 is
    available
    \item Project at opencores.org accepted
    \item Simple UART as SimpCon example
    \item IO in JOP changed to SimpCon (uart, cnt, usb)
\end{itemize}
%
Next steps:
%
\begin{itemize}
    \item Continue this document
    \item Provide Wishbone bridges
\end{itemize}
%
to clarify:
\begin{itemize}
    \item Use transaction or transfer in this document?
    \item Use address phase or better command cycle?
\end{itemize}

%\end{document}


\section{Notes}

\subsection{Group comment}
\begin{verbatim}

After implementing the Wishbone interface for main memory access
from JOP I see several issues with the Wishbone specification that
makes it not the best choice for SoC interconnect.

The Wishbone interface specification is still in the tradition of
microcomputer or backplane busses. However, for a SoC interconnect,
which is usually point-to-point, this is not the best approach.

The master is requested to hold the address and data valid through
the whole read or write cycle. This complicates the connection to a
master that has the data valid only for one cycle. In this case the
address and data have to be registered \emph{before} the Wishbone
connect or an expensive (time and resources) MUX has to be used. A
register results in one additional cycle latency. A better approach
would be to register the address and data in the slave. Than there
is also time to perform address decoding in the slave (before the
address register).

There is a similar issue for the output data from the slave: As it
is only valid for a single cycle it has to be registered by the
master when the processor is not reading it immediately. Therefore,
the slave should keep the last valid data at it's output even when
\emph{wb.stb} is not assigned anymore (which is no issue from the
hardware complexity).

The Wishbone connection for JOP resulted in an unregistered Wishbone
memory interface and registers for the address and data in the
Wishbone master. However, for fast address and control output
($t_{co}$) and short setup time ($t_{su}$) we want to place the
registers in the IO-pads of the FPGA. With the registers are buried
in the WB master it takes some effort to set the right constraints
for the Synthesizer to implement such IO-registers.

The same issue is true for the control signals. The translation from
the \emph{wb.cyc}, \emph{wb.stb} and \emph{wb.we} signals to
\emph{ncs}, \emph{noe} and \emph{nwe} for the SRAM are on the
critical path.

The \emph{ack} signal is too late for a pipelined master. We would
need to know it *earlier* when the next data will be available ---
and this is possible, as we know in the slave when the data from the
SRAM will arrive. A work around solution is a non-WB-conforming
early ack signal.

Due to the fact that the data registers not inside the WB interface
we need an extra WB interface for the Flash/NAND interface (on the
Cyclone board). We cannot afford the address decoding and a MUX in
the data read path without registers. This would result in an extra
cycle for the memory read due to the combinational delay.

In the WB specification (AFAIK) there is no way to perform pipelined
read or write. However, for blocked memory transfers (e.g. cache
load) this is the usual way to get a good performance.

Conclusion -- I would prefer:

    * Address and data (in/out) register in the slave
    * A way to know earlier when data will be available (or
      a write has finished)
    * Pipelining in the slave

As a result from this experience I'm working on a new SoC
interconnect (working name SimpCon) definition that should avoid the
mentioned issues and should be still easy to implement the master
and slave.

As there are so many projects available that implement the WB
interface I will provide bridges between SimpCon and WB. For IO
devices the former arguments do not apply to that extent as the
pressure for low latency access and pipelining is not high.
Therefore, a bridge to WB IO devices can be a practical solution for
design reuse.
\end{verbatim}

\subsubsection{additional comments}
\begin{verbatim}

The idea for (some) pipeline support is twofold:

1.) The slave will provide more information than a single \emph{ack}
or wait states. It will (if it is capable to do) signal the number
of clock cycles remaining till the read data is available (or the
write has finished) to the master. This feature allows the pipelined
master to prepare for the upcoming read.

2.) If the slave can provide pipelining the master can use
overlapped wr or rd requests. The slave has a static output port
that tells how many pipeline stages are available. I call this
'pipeline level':
    0 means non overlapping
    1 a new rd/wr request can be issued in the same cycle
      when the former data is read.
    2 one earlier and
    3 is the maximum level where you get full pipelining
      on the basic read cycle with one wait state
      (command - read - read - result).


The draft of the spec at the moment are few sketches on real paper -
takes some time to draw all diagrams for a document.

I have a first implementation of SimpCon on JOP to test the ideas: A
master in JOP and a slave for SRAM access.
\end{verbatim}

\subsection{e-mail from Robert Finch}

\begin{verbatim}


Hi Martin, I read your comments. I've thought some about the
WISHBONE spec myself.


"Martin Schoeberl" <mschoebe@@mail.tuwien.ac.at> wrote in message
news:<4384f0b3$0$11610$3b214f66@@tunews.univie.ac.at>...
> After implementing the Wishbone interface for main memory access
> from JOP I see several issues with the Wishbone specification that
> makes it not the best choice for SoC interconnect.

> The master is requested to hold the address and data valid through
> the whole read or write cycle. This complicates the connection to a
> master that has the data valid only for one cycle. In this case the
> address and data have to be registered *before* the Wishbone connect
> or an expensive (time and resources) MUX has to be used. A register
> results in one additional cycle latency. A better approach would be
> to register the address and data in the slave. Than there is also
> time to perform address decoding in the slave (before the address
> register).

I've of the opinion that all outputs of masters should be
registered. Registering the outputs hides the timing of the master's
internal signals from the rest of the system and helps turn it into
a 'black box'. However, in my designs I provide both registered and
unregistered versions of outputs, as it is quite handy to have
unregistered signals sometimes. It would have been nice if the
WISHBONE bus spec'd unregistered signals as well as registered ones.
I've just been naming the unregistered signals by including '_nxt'
in the signal name as in 'adr_nxt_o'. '_nxt' standing for the signal
value that will be 'next'.

Why is the MUX needed ?

I've found that a register may indeed result in an additional cycle
of latency, depending on the how the system is put together.
However, I've also found that it doesn't really make any difference
to the performance of the system. Registering the output often
allows the cycle time to be decreased, and the 'lost' cycle of
latency is made up for by better timing. I've also found that the
INTERCON (address decoding, bus muxing logic, and arbitration)
typically requires a full cycle by itself and it's best to have the
signals feeding into the INTERCON already registered. Unless the
system is really small (single master / slave).

By 'address decoding in slaves' I'm assuming you mean partial
address decoding for only register selection. Full address decoding
shouldn't be done in slaves as it wastes a lot of resources. The
address decoding (device/slave selection) should be done by the
INTERCON, and is a function of the system.

Almost always masters are designed to hold address and data valid
until the external system acknowledges the request.

>
> There is a similar issue for the output data from the slave: As it
> is only valid for a single cycle it has to be registered by the
> master when the processor is not reading it immediately. Therefore,
> the slave should keep the last valid data at it's output even when
> wb.stb is not assigned anymore (which is no issue from the hardware
> complexity).

I'm not sure I understand the 'single cycle' timing. Slave devices
I've worked on present valid data as long as the signals coming from
the INTERCON indicate that it should do so. Otherwise the output
data from the slave is allowed to flip around according to whatever
register is addressed as it doesn't affect the system since it's not
muxed to the master's inputs unless it's the addressed device.

Generally, during a read request the master will always be ready to
read data immediately. If it wasn't ready to read the data it
shouldn't have requested it, as this wastes bus bandwidth.

>
> The Wishbone connection for JOP resulted in an unregistered Wishbone
> memory interface and registers for the address and data in the
> Wishbone master. However, for fast address and control output (tco)
> and short setup time (tsu) we want the registers in the IO-pads of
> the FPGA. With the registers buried in the WB master it takes some
> effort to set the right constraints for the Synthesizer to implement
> such IO-registers.
>
> The same issue is true for the control signals. The translation from
> the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the
> SRAM are on the critical path.

I've come to the conclusion that it's unrealistic to expect that
external memory can be accessed at a high rate using only a single
clock cycle. There is naturally a multi-cycle latency when dealing
with an external device operating a high clock rate. The registered
outputs of a WISHBONE master typically wouldn't need to be
registered at the IO-pads.

> The ack signal is too late for a pipelined master. We would need to
> know it *earlier* when the next data will be available --- and this
> is possible, as we know in the slave when the data from the SRAM
> will arrive. A work around solution is a non-WB-conforming early ack
> signal.

I ran into this too. I built a system similar to this and it worked
okay. But, I decided not to build newer systems this way. A problem
is that the latency of external device may vary. This makes it
difficult to pipeline the master. SRAM may have a latency of three
cycles, BRAM two cycles, and IO-devices a single cycle. My (current)
master already has an internal three stage pipeline, adding three
more pipeline stages for memory would turn it into a six stage
monster.

>
> Due to the fact that the data registers not inside the WB interface
> we need an extra WB interface for the Flash/NAND interface (on the
> Cyclone board). We cannot afford the address decoding and a MUX in
> the data read path without registers. This would result in an extra
> cycle for the memory read due to the combinational delay.
>
Yes.  Can the delay be hidden using mult-masters (later) ?

> In the WB specification (AFAIK) there is no way to perform pipelined
> read or write.

This is something I've thought was missing from the spec as well.
However, doing pipelined access across a system bus could be quite a
feat.


However, for blocked memory transfers (e.g. cache
> load) this is the usual way to get a good performance.
>
> Conclusion -- I would prefer:
>
>     * Address and data (in/out) register in the slave
>     * A way to know earlier when data will be available (or
>       a write has finished)
>     * Pipelining in the slave
>
> As a result from this experience I'm working on a new SoC
> interconnect (working name SimpCon) definition that should avoid the
> mentioned issues and should be still easy to implement the master
> and slave.
>
> As there are so many projects available that implement the WB
> interface I will provide bridges between SimpCon and WB. For IO
> devices the former arguments do not apply to that extent as the
> pressure for low latency access and pipelining is not high.
> Therefore, a bridge to WB IO devices can be a practical solution for
> design reuse.
>
> A question to the group: What SoC interconnect are you using?
> A standard one for the peripheral devices and a 'home-brewed' for
> more demanding connections (e.g. external RAM access)?
>
> Martin
>

I'm using an 'enhanced' WISHBONE bus (I added one or two signals,
and renamed a couple).

I found that for my systems it wasn't necessary to pipeline the
memory system to get good performance. The reason being that there
are multiple bus masters, and all the memory bandwidth is consumed
anyway. (CPU, VIDEO, AUDIO, SPRITE, DISK, CPU2). I ended up building
a shared memory controller with an arbitrater that allows each
device access only every third cycle. This effectively hides a three
cycle latency though the memory. The external memory can service a
request every single clock cycle (at 40MHz!). (Just not from the
same master) Every cycle one of the masters is selected to be
allowed a memory access. Three cycles later, read data is available
for that master. From the master's perspective it looks like a
normal WISHBONE bus.

Even though the system isn't pipelined, it's using the maximum
amount of performance it can get out of the memory. As a result,
it's turned out that the WISHBONE bus serves as a suitable bus
system to use.

I'm not sure what's included in JOP system (I'm a news-subscriber),
but it may be easier to get better performance by using multiple
CPU's. For example, one cpu could be handling network communcations
while a second is running Java code (JVM). If there is any kind of
VIDEO or audio (eg MP3) that could be handled by another master as
well.


Good Luck with you're bus design.

Robert

\end{verbatim}

\subsection{comp.arch.fpga}

\begin{verbatim}
>> The last days I played around with the Quartus SOPC builder [1].
>> Although I'm more a batch/make guy, I'm impressed by the easy to use
>> tool. In order to scratch a little bit on the dominance of the NIOS II
>> in the SOPC world I wrapped JOP [2] into an Avalon component ;-)
>
> Kudos, that is excellent. Any lessons/gotchas about turning JOP into an
> SOPC components should someone else fancy a similar undertaking?

The Avalon bus is very flexible. Therefore, writing a slave or
master (SOPC component) is not that hard. The magic is in the Avalon
switch fabric generated by the builder. However, an example would
have helped (Altera listening?). I didn't find anything on Altera's
website or with Google. Now a very simple slave can be found at [1].

One thing to take care: When you (like me) like to avoid VHDL files
in the Quartus directory you can easily end up with three copies of
your design files. Can get confusing which one to edit. When you
edit your VHDL file in the component directory (the source for the
SOPC builder) don't forget to rebuild your system. The build process
copies it to your Quartus project directory.

When you want to start over with a clean project the only files
needed for the project are: .qpf, .qsf, .ptf

The master is also ease: just address, read and write data,
read/write and you have to react to waitrequest. See as example the
SimpCon/Avalon bridge at [2]. The Avalon interconnect fabric handles
all bus multiplexing, bus resizing, and control signal translation.

>> However, of course there is some drawback. The performance of the
>> Avalon system is lower than a 'native' connection (or in my case
>> via SimpCon [5]) of the main memory to the CPU. I can provide some
>> numbers if there is interest...
>
> Care to elaborate? I'd expect going over Avalon could add latency, but
> if you can exploit multiple outstanding transactions (aka "posted
> reads") and/or bust transfers, the bandwidth should be the same as
> "native".

Yes, the latency is the issue for JOP. JOP does not trigger several
read or write transactions. However, it can trigger one transaction
and than continue to execute microcode. When the (read) result is
needed, the JOP pipeline is stopped till the result is available.
What helps is to know in advance (one or two cycles) when the result
will be available. That's the trick with the SimpCon interface.
There is not a single ack or waitrequest signal, but a counter that
will say how many cycles it will take to provide the result. In this
case I can restart the pipeline earlier.

Another point is, in my opinion, the wrong role who has to hold data
for more than one cycle. This is true for several busses (e.g. also
Wishbone). For these busses the master has to hold address and write
data till the slave is ready. This is a result from the backplane
bus thinking. In an SoC the slave can easily register those signals
when needed longer and the master can continue. On the other hand,
as JOP continues to execute and it is not so clear when the result
is read, the slave should hold the data when available. That is easy
to implement, but Wishbone and Avalon specify just a single cycle
data valid.

>> BTW: The Cyclone II FPGA cannot be clocked really faster than the
>> Cyclone (just a few %). I hoped to get some speed-up for free due
>> to a new generation FPGA :-(
>
> I was surprised too when I saw that. I gather the only way the Cyclone
> II can gain you speed over Cyclone I is when you can use the embedded
> multipliers. Makes me wonder about the upcoming Cyclone III.

Are there any other data available on that. I did not find many
comments in this group on experiences with Cyclone I and II. Looks
like the CII was more optimized for cost than speed. Yes, waiting
for III ;-)

Martin

[1]
http://www.opencores.org/cvsweb.cgi/~checkout~/jop/sopc/components/avalon_test_slave/hdl/avalon_test_slave.vhd

[2]
http://www.opencores.org/cvsweb.cgi/~checkout~/jop/vhdl/scio/sc2avalon.vhd

Hi Antti,

> most of the SOPC magin happens in the perl package "Europe" ASFAIK.
> dont expect a lot of information about the internals of the package.

That's fine for me. When the connection magic happens and I don't
have to care it's fine. OK, one exception: Perhaps I would like to
know more details on the latency. The switch fabric is 'plain' VHdL
or Verilog. However, generated code is very hard to read.

> as very simple example for avalon master-slave type of peripherals there
> is on free avalon IP core for SD-card support the core can be found
> at some russian forum and later it was also added to the user ip
> section of the microtronix forums.

Any link handy for this example?

> the avalon master is really as simple as the slave.

Almost, you have to hold address, data and read/write active as long
as waitrequest is pending. I don't like this, see above.

In my case e.g. the address from JOP (= top of stack) is valid only
for a single cycle. To avoid one more cycle latency I present in the
first cycle the TOS and register it. For additional wait cycles a
MUX switches from TOS to the address register. I know this is a
slight violation of the Avalon specification. There can be some
glitches on the MUX switch. For synchronous on-chip peripherals this
is absolute not issue. However, this signals are also used for
off-chip asynchronous peripherals (SRAM). However, I assume that
this possible switching glitches are not really seen on the output
pins (or at the SRAM input).

Martin


\end{verbatim}


\end{document}
@


1.3
log
@update from JOP
@
text
@@


1.2
log
@no message
@
text
@d446 1
a446 1
\end{document}
d452 1
d526 1
d529 1
d557 319
@


1.1
log
@Add document sources to the project
@
text
@d429 2
a436 2
    \item Provide more SimpCon examples (e.g.\ a UART)
    \item Change JOPs IO interface to SimpCon
@