Computer Design
A computer aided design and VLSI approach

Paul J. Drongowski

Chapter 11 - Speed, space and power estimates.

Ultimately, the system must provide adequate performance
and satisfy any engineering requirements (unit cost, size,
packaging, power supply, cooling.) Procedures for estimating
speed, space and power characteristics are given in this
chapter. We will use a fictitious set of building blocks
which resemble VLSI standard cells to illustrate the
estimation process.

Section 1 - Introducing standard cells.

Standard cells are an important technological choice for the
implementation of a computing system. (Chapter 16 will look
at standard cells in more detail.) A standard cell is
a VLSI building block which performs a common operation such as
an adder, register, register file, etc. Design begins in the
usual way -- a block diagram is drawn for the data and control
paths along with the control flow graph. To obtain a completed
design that is ready for fabrication, the standard cells must be
placed on the surface of the chip ``layout'' and interconnected
using silicon (or metal) wires. Any signal coming into or leaving
the system must pass through special input or output circuits
which condition the signal for the external environment and
provide bonding pads. The bonding pads will be used to connect
the chip with the DIP leads.

Because circuits are physical devices, they consume space
and power, dissipate heat and have some minimum switching
speed. The tricky part of VLSI design is choosing a dense
arrangement of circuits within the confines of a fixed size
chip. Input, output and tristate circuits must be placed
along the borders of the chip to make wire bonding easier.
Power consumption and dissipation must be within acceptable
limits. Switching speed and propagation delays depend heavily
upon wire lengths. Shorter wires have lower capacitance and
shorter propagation delays. Critical path components
should be placed near each other, thereby reducing the length
of the wires (and delays!) along the critical timing path.
Off-chip interconnections are the worst -- capacitances are
50 to 100 times higher than on-chip wires. Power consumption
heat dissipation and delay are proportionately higher (if we
attempt to keep delay constant.)


Standard cells          Blocks  Supply   Power   Delay
----------------------  ------  -------  ------  -------
8 x 8 register file     4 x 5    7   mA  35  mW  30 nsec
16 x 16 register file   8 x 10   25  mA  125 mW  35 nsec
8-bit register          4 x 1    0.8 mA  3.7 mW  20 nsec
16-bit register         8 x 1    1.5 mA  7.5 mW  20 nsec
8-bit counter           5 x 1    1.0 mA  5   mW  20 nsec
16-bit counter          10 x 1   2.0 mA  10  mW  25 nsec
Control store           12 x 6   50  mA  250 mW  35 nsec
ALU                     8 x 4    20  mA  100 mW  30 nsec
16-bit 2:1 mux          4 x 1    0.8 mA  4   mW  10 nsec
16-bit 4:1 mux          4 x 1    0.4 mA  2   mW  10 nsec
8-bit external input    8 x 2    0   mA  0   mW  0  nsec
16-bit external input   16 x 2   0   mA  0   mW  0  nsec
8-bit external output   8 x 2    20  mA  100 mW  100 nsec
16-bit external output  16 x 2   40  mA  200 mW  100 nsec
8-bit ext. tristate     8 x 2    25  mA  110 mW  100 nsec
16-bit ext. tristate    16 x 2   45  mA  225 mW  100 nsec

       Table 1 - Fictitious standard cell set.

Section 2 - Clock period.

In order to determine overall system performance, we must first
compute the expected clock period. This time will be determined
by the longest delay path. The longest delay path is found by
tracing the flow of information through each of the paths leading
from the source (operand) storage elements through any data
operators or selection blocks to the destination storage elements.
The delay for each path is the sum of:
  * The time required to read an operand value (if any),
  * The switching delays incurred by the data operators and
    selectors, and
  * The set-up and hold times of the destination storage elements.
As we will see in a later chapter, the delay times are perturbed
by the length of the wires interconnecting the blocks -- a significant
source of delay if the wires are long. The longest path in Figure 1
starts at block Source-0 and ends to Destination-1. Assuming a 20 ns
combined set-up and hold time (from Table 1), the path delay and minimum
clock period is 90 ns.

                    -------
                   |       |
      Source ----->| 50 ns |-----------------> Destination
            0      |       |         |                    0
                    -------          |
      Source --                      V
            1  |    -------       -------
                -->|       |     |       |
                   | 25 ns |---->| 20 ns |----> Destination
                -->|       |     |       |                 1
               |    -------       -------
      Source --
            2

                Figure 1 - Longest delay path.

Programmable and multiphase clocks complicate the analysis a bit.
Instead of estimating a single clock period, the designer must
determine minimum times of the individual clock phases or periods.
The same procedure can be used, but the assignment of data or
storage operations to phases (periods) must be taken into account.

This procedure sums the individual delay times of the components.
Logic circuits are not strictly on-off devices, but possess an
analog character as well. Thus, successive stages of logic will
begin to switch even though the earlier stages have not achieved
their final minimum or maximum voltage values. A simple sum will
yield a conservative pessimistic estimate. For MOS combinational
logic, the square root of the sum of squares provides a better
approximation to actual delay. The expected clock period for the
example in Figure 1 is 54 + 20 = 74 ns using this method.

                   |
                   *
                  * *
                 * | *
                 * | *
                *  |  *
               *   |   *
             *     |     *
           *       |       *
        *          |          *
      -----|---------------|-----
          Min   Typical   Max
    Reject |   Acceptable  | Regrade

    Figure 2 - Distribution of delays.

The integrated circuit fabrication process is not perfect
and real component delay times will obey a statistical
distribution over time (Figure 2.) The published specifications
for a part will show three delay values: minimum, typical and
maximum. Delivered parts will operate within the stated minimum
and maximum delay values and the average of the distribution
will be the typical (or "nominal") delay. Devices that operate
faster or slower can be regraded and sold at either a premium or
reduced price (respectively.)

If the distribution is symmetric about the mean, then one
would expect half of the parts to run slow and the other half
to run faster than the typical delay value. The actual speed
of an assembled product depends upon the characteristics of
the its components. Two alternatives are possible.
  * The designer is optimistic and assumes that fast parts
    will offset the increased delay of the slower parts.
    Optimists use typical delays when computing the clock speed.
  * The designer is pessimistic and performs a worst case
    computation using stated maximum delays.
In the second case, the clock period will always be sufficiently
long to guarantee reliable operation of production units. However,
worst case analysis may force the designer to set artificially
high and costly timing goals for the design.

The optimistic designer, however, has made one bad assumption --
that all components have equivalent delay times which will
compensate for one another. This assumption does not always hold.
For example, the increased delay of a relatively slow ALU is not
necessarily offset by a faster NAND gate. Increased computation
time will reduce storage set-up time below the necessary minimum
and correct, reliable operation will cease. A portion of the
production units will fail for this reason. The percentage of
failed units can be reduced by adding a safety margin (fudge
factor) to the clock period calculation.

VLSI designers have a slightly different problem. Since all
building blocks reside on the same chip and are manufactured
under the same processing conditions, the speed of all blocks
will either increase or decrease. (We are assuming uniform
processing across the die. This assumption definitely does not
hold for wafer scale systems.) The designer must choose a clock
period that guarantees an economically viable yield-- the percentage
of good parts after fabrication. If the target clock speed is set
too high, then too few devices will be fabricated with acceptable
minimum performance. Low yield usually means low profit as the
expense of fabrication must be spread across the price of the
working chips.

Section 3 - Control graph analysis.

Once the clock speed has been determined, the control graph
can be analyzed. The following procedure is suggested.
  * Label each control event symbol with its expected
    execution time. This time will vary from event to event
    in systems employing a programmable clock.
  * For each instruction or system operation, trace the execution
    flow of the control graph beginning with instruction fetch and
    ending with the last event needed to interpret the instruction.
Conditional branches and loops pose an interesting analytical
problem. One solution is to trace each execution path and build
a table of conditions and execution times. In the case of the
SP.1 branch on zero instruction, the table below:

       Condition   |  Time
     -------------------------
           Z = 0   |  100 ns
           Z = 1   |  200 ns

gives the execution time of the conditional branch instruction
when the Z flag is both true and false. Loop times may be
parameterized in the number of iterations around the loop.

Section 4 - Space.

Naturally, physical circuits occupy space. A variety of packaging
and interconnection techniques are available and usually two or
more different methods will be used in the mechanical design of
the product. (Packaging and interconnection will be discussed in
detail in Chapter 17.)
  * At the printed circuit boards level, packages (dual in-line,
    pin grid array, leadless chip carrier, etc.) are interconnected
    by metal wires routed on one or more board layers. Board 
    area is finite (obviously) and is often determined by the
    standard form and mechanical design employed by a vendor.
  * VLSI chips consist building blocks that are interconnected by
    semiconductor and metal wires. Chip area is limited by the
    defect density of the fabrication process and the desire for
    acceptable production yields.
Thus, the bottom-up space analysis to be performed must determine
if the data and control circuits can fit onto a board or chip of
the desired size. If the design is too big, it must be broken
across chip or board boundaries.

To estimate size, the designer drafts a scale drawing of the
physical layout of the system. The "floor plan" shows the relative
sizes and placement of the building blocks on each chip or board.
Design experience and maximum power dissipation levels help to
determine the maximum number of blocks that can fit onto the floor
plan. On-chip wiring is a significant concern in VLSI design as
up to 80 percent of chip area may be required for interconnect.
System designers always try to use the most inexpensive cooling
method possible. The fundamental physical limitations on heat
transfer may dictate the use of fewer components or a lower power
circuit technology.

Percentage of active and wire area and the number of circuits or
pins can be used as rough size (and cost) measures.

Section 5 - Power.

Current consumption and power dissipation values are
calculated by summing the individual current and power values
of the components in the system. Total heat dissipation will
indicate if system packaging and cooling are sufficient. If
the expected power dissipation is too high, then either
a more expensive form of cooling or packaging will be required
or a lower power design must be found.

The computed current consumption indicates the size (and cost)
of the power supply for the product. This value is also used to
size the power distribution grid to assure that power bus wires
will have sufficient cross-sectional area to carry power with
a safe current density.

Actual system current consumption and power dissipation
depends the operational characteristics of the circuit technology.
In nMOS and bipolar technologies, designers assume that half of the
gates will be switched high and the other half will be switched
low. Thus, actual current and power will be 50 percent of their
worst case values. (The percentage by which a value is derated
is sometimes called the "duty cycle.") Power dissipation in
CMOS circuits depends upon the clock (switching) frequency since
more current flows through a CMOS while it is switching than while
it is quiescent.

Area, current and power dissipation calculations take the form
of a spreadsheet (Figure 3.) The percentage of active area is
the area occupied by the blocks divided by the total area of
the system (chip.)

   Quantity | Component       | Area | Current | Dissipation
   ---------------------------------------------------------
       1    | Register file   |  80  |  25 mA  |   125 mW
       1    | ALU             |  32  |  20 mA  |   100 mW
       4    | 2:1 multiplexer |  16  |   3 mA  |    15 mW
   ---------------------------------------------------------
                                128     48 mA      240mW

      Figure 3 - Area, current and power spreadsheet.

Section 6 - Design problem ???.

For this problem, you must design the data and control
paths for the SP.1 computer using the set of standard cells
given in Table 1. Your design documentation should include
the following items.
  * A block diagram for the system showing the cells and their
    interconnections.
  * A control graph which will become the microcode (controller
    programming information) for the system.
  * A floor plan which shows the physical placement of the cells
    on one or more chips.
  * Estimates for clock period (microinstruction cycle time), power
    consumption, power dissipation, percentage of active (cell) area,
    and percentage of wire area.
  * An essay describing your design considerations and trade-offs.
The floor plan must be drawn on graph paper. Total chip area
is 18 by 18 blocks and at a minimum, forty percent of chip
area should be allocated for wiring. The maximum amount of
power which the system may dissipate is two Watts.

A few notes on Table 1 are in order. Cell sizes are given
in "blocks." When laying out the system, you should assume that
data flows into one side of a cell and out the opposite side and
that parallel data will arrive and leave a cell along its broadest
sides. Current consumption and power dissipation values are maximums
and may be scaled by the expected duty cycle. Delay times are also
worst case. The delay time for a storage component is the combined
set-up and hold time of the standard cell.

Copyright (c) 1987-2013 Paul J. Drongowski