Computer Design
A computer aided design and VLSI approach

Paul J. Drongowski

Chapter 10 - Style.

Chapters 8 and 9 on basic datapath and control design give
us the tools to examine stylistic issues in processor design.
A designer does not look upon the system as a "random"
collection of components and events. Rather, she or he organizes
the system structure in accord with a consistent theme that
enhances the comprehensibility of the design. Humans naturally
try to produce order from disorder and that is the function
of style in computer design.

This chapter presents three kinds of choices each in datapath
and control design. The key trade-offs in using a particular
style are:
  * The number of datapath operations which can be performed
    in parallel,
  * The number of control events needed to execute an aggregate
    system operation,
  * The number of components needed in the datapath or control
    design (with direct effect on current, space and cooling
    requirements), and
  * Interconnection complexity.
Of course, one does not have to rigidly follow a style in the
extreme. If execution speed can be increased or component count
reduced by choosing a "hybrid" of two or more styles or by
specializing the design to exploit a feature of the architecture,
then by all means the designer should take advantage of the
situation!

Section 1 - Busses.

The naive synthesis procedure given in Chapter 9 will usually
yield a "chaotic" datapath since its structure is driven solely
by the data dependencies in the ISA specification. The advantage
of this scheme, however, is that the datapath can be highly
specialized for the ISA. We call this the "zero bus" style.

The zero bus style is compatible with an event scheduling strategy
that minimizes the number of control steps in order to increase
execution speed. All transfers that can be performed in parallel
(subject to flow dependencies between the computation and use of
a variable) are set to execute in one control event. Flow
dependencies where the value of a variable must be computed and
stored in one event in order to be used in a subsequent event,
will govern the assignment of transfers across sequential events.

The zero bus style will result in the fastest possible schedule
(shortest sequence of control steps), but will be expensive in
component count and interconnection complexity. All expressions
must be fully evaluated in one control step and little component
sharing is allowed. A component can serve two or more uses
only if each use can be scheduled independently across two or
more control steps. It may be necessary to share expensive (large
or power hungry) resources such as a multiplier to meet spatial,
current or cooling constraints. In that case, the common resource
must be "multiplexed" in time as only one operation may be active
at any time. The operations using the resource, therefore, must
be serialized into sequential control events. This is a classic
space-time trade-off.

To implement the system as a VLSI circuit or printed circuit board,
the components must be position in a two dimensional (2-D) plane
and the wires must be routed according to the schematic. Clearly,
the number of wires and distinct point to point connections will
affect the amount of board or chip area needed for interconnect.
(On-chip wiring, for example, may require as much as 80 percent of
total chip area.) Further, wire routing in VLSI circuits is an
expensive task. If done manually, routing is a very time-consuming
and error prone task. Although routing programs are faster and
more accurate, they still consume large amounts of memory and
computer cycles at a very real cost to the manufacturer. An automatic
router may be able to route only 95-99 percent of the system leaving
the last 2-5 percent for manual layout.

    Write            Read
      |      ---      |
      |---->| X |---->|
      |      ---      |
      |      ---      |
      |---->| Y |---->|
      |      ---      |
      |      ---      |
      |---->| Z |---->|
      |      ---      |
      |---            |
      |   |           |
      |   V           |
      |  ---          |
      | | T |    -----
      |  ---    |
      |   |     |
      |  ---------
      | |   ALU   |
      |  ---------
      |      |
       ------

   Figure 1 - Example two bus datapath.

To reduce routing complexity, designers often introduce shared
"busses" into their designs. A bus also make the system expandable
if it is brought out to a readily accessible connector. Figure 1
depicts a typical two bus datapath. The bus on the right is used
for reading operands from the storage elements X, Y and Z. The read
bus is routed to the right hand input of the arithmetic logic unit.
ALU results are written to X, Y and Z from the bus on the left.
Note that the temporary register T is needed to hold the left hand
operand to the ALU. Two control steps will be required to perform
the addition "Z <- X + Y" as shown in Figure 2.

       -------------
      | Move X to T |
      |-------------|
      | T <- X      |
       -------------
             |
             V
       -------------
      | Add & store |
      |-------------|
      | Z <- T + Y  |
       -------------

    Figure 2 - Two bus sum "Z <- X + Y."

Like all shared busses, the read and write bus can have just one
transmitting component at any time to prevent data collisions.
Bi-directional buffers, switches or multiplexers may be used to
implement the read bus.

All data transfers must be serialized across the read bus. This
disadvantage of the two bus approach was noted above -- dyadic
data operations require two control steps. The three bus approach
alleviates this problem at the cost of a third bus. An operand
may be placed onto either read bus and sent to either input of
the ALU (Figure 3.) If dyadic operations are used infrequently,
then the cost of the third bus may be too high and the two bus
approach would be preferred. Dyadic operations (add and multiply)
are critical to signal processing applications and the cost of
third bus may be justified solely on the need for speed.

    Write            Read  Read
      |      ---      |     |
      |---->| X |---->|---->|
      |      ---      |     |
      |      ---      |     |
      |---->| Y |---->|---->|
      |      ---      |     |
      |      ---      |     |
      |---->| Z |---->|---->|
      |      ---      |     |
      |               |     |
      |              ---------
      |             |   ALU   |
      |              ---------
      |                  |
       ------------------

   Figure 3 - Example three bus datapath.

J-K master-slave flip/flops will be required for resisters
that must be read and written in the same clock cycle (e.g.,
to perform transfers such as "X <- X + Y.") Why? As we will
see later, the same capability can be gained through the proper
use of a two-phase non-overlapping clock.

Section 2 - Storage.

The naive register assignment procedure in Chapter 9 calls
for a separate register or register file for each ISA variable.
The designer could, however, assign all of the ISA variables
to a single register file. This extreme assignment scheme would
reduce the complexity of interconnection and in the case of a
two bus datapath, would exploit the addressing mechanism of the
register file to select read bus operands.

All transfers in or out of the register file will be serialized,
however. The opportunity to assign several variables to a single
file usually arises when RAM locations are left over after the
ISA general registers have been allocated to the register file.
Thus, the designer is getting some registers "for free" and wants
to take advantage of the situation. The decision to assign or
not to assign to a given variable should be based upon the
frequency of use of the variable. Those variables which see only
occasional use should definitely be allocated to the register file.
High traffic variables such as those needed for memory address
computation or instruction fetch and decode are best allocated to
a separate register block. The use of a register file or RAM also
precludes the exploitation of intrinsic operations.

      Address  Address
         A       B
         |       |
         V       V
        -----------
       |           |
       |           |----> Output A
  ---->|           |
       |           |----> Output B
       |           |
        -----------
             |
            W/R

    Figure 4 - Dual port RAM.

The disadvantages of a standard, one word at a time register
file can be removed if a dual port RAM is used instead. A dual
port RAM permits two memory words to be read simultaneously.
Two independent address and data output ports are required.
A dual port RAM can be used to good advantage in a register
to register architecture. Simultaneous access to two general
registers means that the interpretation of most register to
register instructions will require at most one control step.
When a dual port RAM is not available as an off the shelf
building block, its operation can be simulated using two
standard RAM's (Figure 5.) A data item is written to each RAM
in tandem, keeping the contents of the two RAM's consistent.

         Address   Address
             B       A
             |       |
             |       V
             |      ---
         --------->| R |----> Output A
        |    |      ---
        |    |
    --->|    |
        |    V
        |   ---
         ->| R |------------> Output B
            ---

    Figure 5 - Simulated dual port RAM.

Section 3 - Operators.

Component count is reduced by replacing two or more single function
data operators by a more "universal" component that covers all of
the needed functions. The aforementioned restrictions on datapath
parallelism will apply as functions that were previously performed
in parallel are sequentialized through the new common component.
Intrinsic operations may be exploited at the cost of increased
control complexity and marginally higher space and power.

Up to this point we have freely traded components and have assumed
that component delay times are equal. Data operators have a much
wider range of performance. Delay times are given in Table 1 for
simple arithmetic and logical operations for both specialized
and ALU implementation (on sixteen bit words.)
Simple logical functions take longer through the ALU, but the carry
lookahead logic of the ALU speeds addition considerably. The ALU
dissipates more power in all cases. Depending upon the characteristics
of the building blocks, therefore, the execution time of each control
step may actually be increased by replacing simple data operators with
a more general component. An increase or decrease will affect the
execution time of all control steps in asynchronous system.

     Special function block | 74LS181/182
     --------------------------------------------
     Not    9.5 ns  32 mW   |  22.0 ns  400 mW
     And   12.0 ns  68 mW   |  22.0 ns 
     Or    12.0 ns  80 mW   |  22.0 ns
     Add  165.0 ns 300 mW   |  44.0 ns

  Table 1 - Special vs. general data operators.

Section 4 - Control style.

Two different implementation styles are available for the
implementation of the controller. In the "hard-wired" style,
a conventional sequential machine is designed using either the
state table, delay element or sequence counter method. The
"microprogrammed" style is a variant of the state table method.
The microprogram (state table) is stored in a memory element
from which microinstructions (control steps) are selected for
execution.

Control implementation will be discussed in a later chapter.
However, the comparative advantages and disadvantages can be
addressed here. Microprogramming is a popular approach because
once the datapath design is fixed, the machine is programmed
to emulate the ISA -- a task which is perceived to be easy.
Bugs are easily corrected by changing the microprogram and
distributing a new version of the binary microcode in ROM or
on floppy disk (for machines with a writable control store.)
Depending upon the implementation method used for hard-wired
control, corrections will range from easy (state table) to
difficult (sequence counter.)

There is a tendency when designing a microprogrammed engine
to bloat the size of the microstore. Horizontally microprogrammed
engines have a low degree of microinstruction encoding and
microinstructions are long (e.g., greater than 48 bits). The
designers are tempted to add new "features" with an increase
in microprogram length. Horizontal microwords plus "featuritis"
is a bad mix as the 2-D control store grows rapidly. Simple
machines such as Reduced Instruction Set Computers are a reaction
to the excesses of horizontal microprogramming. RISC's rely
upon small grain functions that can be executed in a single control
step with minimal macroinstruction decoding. Extensive microprogramming
is unnecessary. Microprogramming is still a viable implementation
technique if designers can exercise self-restraint and limit the
ISA and code size.

Section 5 - Pipelining.

As C or Pascal programmers, we are used to thinking about
the sequential execution of programs. The discussion on datapath
parallelism demonstrates that hardware designers have the option
to perform several operations concurrently if sufficient
functionally independent resources are available. This kind
of concurrency is called "horizontal" or "spatial" concurrency.

If we observe that two operations A and B are always performed
in series:
   -----------------------------
   | A | B | A | B | A | B | .......
   -----------------------------
then the operations A and B can be overlapped in time:
   ---------
   | A | B |
   -------------
       | A | B |
       -------------
           | A | B |
           ---------
               | ...........
               --------
thereby shortening the overall execution time. This kind of
concurrency is called "pipelined" or "vertical" concurrency.
The term "pipeline" derives from a conceptual pipe of processing
elements:
    ---        ---
   | A |----->| B |
    ---        ---
in which a result produced by A is processed by B while A is
producing the next result. Although A and B form a "two stage"
pipeline, the idea can be extended to an N-stage pipeline as
long as there are enough independent tasks and processors
available.

       -------------------
    1: | Fetch | Execute |
       -----------------------------
    2:         |  Fetch  | Execute |
               -------------------------------
    3:                   |  Fetch  | Execute |
                         ---------------------

          Figure 6 - Instruction lookahead.

Instruction lookahead is a pipelining technique which prefetches
an instruction while the current instruction completes execution.
The execution sequence for three instructions in the two stage
pipeline:
      -----------       ------------
      | Fetcher |------>| Executer |
      -----------       ------------
is given in Figure 6. The Berkeley RISC machine has three stages
(Figure 7.) Instructions pass through three phases: fetch and
decode, result computation and result store.

       -------------------------------
    1: |  Fetch  | Compute |  Store  |
       -----------------------------------------
    2:           |  Fetch  | Compute |  Store  |
                 ---------------------------------------
    3:                     |  Fetch  | Execute | Store |
                           -----------------------------

         Figure 7 - Berkeley RISC pipeline.

Pipelining assumes that the stages will operate independently
and that the stream of data will be uninterrupted. Unfortunately,
this assumption is not true in practice. Branch instructions and
interrupts disturb the regular flow of instructions through
the pipeline. Results produced by an instruction are often
needed by the next instruction.

The harshest solution to the interdependence problems above is
to empty or "flush" the pipeline and restart execution. If the
effort to set-up the pipeline is substantial as in the case of
long a pipeline, this solution is expensive. One can also
anticipate pathological cases where the length of the pipeline
is longer than a short loop.

One approach to the problem of branching and interrupts is the
"delayed branch." In this scheme, the normal flow of the pipeline
is not immediately disturbed. The prefetched instruction is
executed while an instruction is fetched from the (new) target
location. Thus, the effect of the interrupt or branch is delayed
for one or more execution cycles. Berkeley employed the delayed
branch technique in their RISC.

In any parallel system, results must be produced before they
are consumed. A close examination of Figure 7 will reveal a
potential problem between the store and compute stages of the
Berkeley pipeline. During the compute phase, operand values
are read from the general register set (which was implemented
with a dual port array.) If the store and compute phases are
overlapped, however, a needed result may not have been written
into the general registers before it is read during the compute
phase of the next instruction. The Berkeley team solved this
problem by keeping the phase two result in a separate data
buffer and by detecting the interdependence between the two
instructions. Rather than reading the wrong value from the register
array, the operand is retrieved from the buffer instead.

Section 6 - Clocking.

One major drawback of synchronous design with a single phase
clock is that the minimum clock period is determined by the
speed of the slowest delay path through the design. Results
must be computed fast enough to guarantee the minimum set-up
time for reliable data storage. A programmable clock alleviates
this problem by giving the designer a choice of clock periods.
While constructing the microcode (or hard-wired controller),
the engineer can select the minimum clock period to perform
the desired data operations. In order to handle the specialized
data operators in the first column of Table 1, the designer
might use a clock with two selectable periods of 45 and 195
nanoseconds (assuming a combined 30 nanosecond set-up and hold
time.) The two event sequence requires 390 and 240 nanoseconds
to execute under fixed and programmable speed clocks, respectively.

                          Fixed   Programmable
       -------------
      | Short op    |     195 ns     45 ns
      |-------------|
      | Z <- X & Y  |
       -------------
             |
             V
       -------------
      | Long op     |     195 ns    195 ns
      |-------------|
      | Z <- X + Y  |
       -------------

    Figure 8 - Graph under fixed and programmable clocks.

Every control step has some overhead associated with its
execution. A non-pipelined microprocessor, for example, must
fetch a microinstruction before the "real" work is begun.
The fetch time is overhead. By breaking the clock period into
phases and assigning different operations to each phase, we
can accomplish more work per control step.

Figure 9 contains an example of a three phase clock. (We will
discuss the idea of a two-phase non-overlapping clock in a
later chapter. This idea is similar, but has some subtle
differences.) The clock period is broken into three phases
and operations are assigned to each phase. Operations that
complete in a short time are controlled by Phase 1 while the
slowest are enabled by Phase 3. Pulses are generated at
different times during the clock period.

                             ____
     Phase 1 _______________|    |_____________________
                                      ____
     Phase 2 ________________________|    |____________
                                                ____
     Phase 3 __________________________________|    |__

        Figure 9 - 3-phase clock example.

Copyright (c) 1987-2013 Paul J. Drongowski