Computer Design
A computer aided design and VLSI approach

Paul J. Drongowski

Chapter 7 - Building blocks.

As explained in Chapter 6, good engineering design involves
both top-down and bottom-up design. While top-down design permits
an appropriate and comprehensible decomposition of system
function, bottom-up design is needed to accommodate the real
engineering concerns of the implementation.

This chapter describes a set of building blocks for digital
system design. We will briefly describe each block and give
control information. The physical characteristics of the blocks
are summarized in the final section and may be used in the
organization level design of the SP.1.

Section 1 - Register storage.

Registers provide small amounts of relatively fast store.
A "simple" register stores one word of data. Data is read
and written in parallel using the data inputs (X) and
outputs (Y), respectively. Normally, the outputs are always
driven and the data stored in the register is available without
an explicit "read" operation. One or more control lines must be
provided for the "load" or "write" operation, however.
In the case of standard 74LS-series logic, the enable signal is
usually a gated clock edge whose leading (or trailing) edge evokes
the load operation. Figure 2 shows how a clock edge may be gated
(conditionally controlled) using an external AND gate. This
function may be included in the operation of the register in which
case the "write enable" and clock signal(s) must be distributed
to the register instead of a gated-clock.

            ---
           |   |
    X ---->|   |----> Y
           |   |
            ---
             ^
             |
          Enabling
          signals

    Figure 1 - Simple register.

                      ---
     Write enable ---|   |
                     | * |----> Gated edge
     Clock ----------|   |
                      ---
                ____      ____
     Clock ____|    |____|    |____

                        _________
     Write ____________|         |_

                          ____
     Gated ______________|    |____
      Edge

           Figure 2 - Gated clock.

Optionally, a simple register may include one or both of the
intrinsic operations "preset" and "clear." All bits in the
register will become one when preset is evoked. The register
will be set to zero when clear is performed.

A simple register of word length N can be constructed by
interconnecting N one bit storage elements (or flip-flops.)
The N elements share the same clock, enable, clear and preset
control signals. Thus, each of the elements will behave
in concert with the others. The process of building multi-bit
devices from a single bit design is called "replication."
Replication is a powerful mechanism since a single design
can be used many times although it is drawn and tested only
once. VLSI systems in particular make extensive use of
replication because custom circuit layout and verification
is an especially time-consuming and expensive task. If a
design is largely composed of a few replicated elements, it
is said to be "regular."

Simple registers appear frequently in digital design. They
may be used to hold intermediate results, general register
values, interface data, instructions and processor conditions.

A "counter" is similar to a simple register. However, it can
also perform one or both of the intrinsic operations "increment"
and "decrement." (Increment and decrement are sometimes called
"count up" and "count down.") Increment and decrement add and
subtract one from the register, respectively, when the clock is
asserted. The count value is initialized using the "load" and
"clear" operations. In Figure 3, separate control lines are
drawn for the four intrinsic operations and the clock.

             -- Clear
            | --- Load
            || ---- Increment
            ||| ----- Decrement
            ||||
            VVVV
            ----
           |    |
    X ---->|    |----> Y
           |    |
            ----
             ^
             |
           Clock

    Figure 3 - Counter.

A counting register may be used as a program counter (instruction
pointer), auto increment or decrement address register, word
count register for block transfers or push-down stack pointer.
Any situation where a value must be increase or decreased by
one is a potential application.

A "register file" is shown in Figure 4. A register file is a
one-dimensional array of binary words. An address or index
input must be provided to explicitly select a word for writing
or reading. The state of the read/write select input will specify
which of these operations should be performed.

             ---- Read/write select
            |   --- Address (index)
            |  |
            V  V
            ----
           |   |
    X ---->|   |----> Y
           |   |
            ---
             ^
             |
           Clock

    Figure 4 - Register file.

One of the most common uses for a register file is to
implement the general register set for a machine ISA. If any
spare words are available within the file beyond those needed
for the general register set, the spares can be used as
temporary locations or as any other register in the ISA.

The shift register is a simple register with two additional
intrinsic operations, "shift left" and "shift right." When the
clock is asserted, each bit will move one position to the
right or left. Inputs and outputs are provided at either end
of the shift register to handle data entering and leaving
the register as the result of a shift. If an arithmetic
shift or rotation is required, the inputs must be set to the
appropriate "boundary" values. In the case of a right arithmetic
shift, for example, the sign bit must be reproduced at the left
end of the word.


                    -- Clear
                   | --- Load
                   || ---- Shift right
                   ||| ----- Shift left
                   ||||
                   VVVV
                   ----
    Right in ---->|    |<---- Left in
           X ---->|    |----> Y
    Left out ---->|    |----> Right out
                   ----
                    ^
                    |
                  Clock

    Figure 5 - Shift register.

Arithmetic, logical and rotation shifts can be implemented using
a shift register. The shifts can be performed "in place" without
separately reading, rearranging and writing the contents. A shift
register may also be used when multiplication or division by a
power of two is required.

Section 2 - Data operators.

Data operators combine and transform digital information as it
passes to and from storage. Data operators are "functional" in
the mathematical sense. The output of a data operator is not
history sensitive and depends solely upon the value of the inputs.
Data operators are usually implemented using combinational logic.

            --- 
           |   |----> Z0
   X0 ---->|   |----> Z1
   X1 ---->|2:4|----> Z2
           |   |----> Z3
            ---

              Figure 6 - Two decoders.

The simplest data operator is the decoder. The decoder has N
inputs and 2^N outputs. For each value of N, only one output
will be asserted. Two decoders (two to four and three to eight)
are shown in Figure 6. Decoders come in two forms: high true
and low true. In the high true form, only one of the outputs
will be high while the rest are low. The opposite situation
holds for low true decoders, i.e., one line is asserted low
while the others remain high.

A decoder can be used to encode control information. For
example, the operation code field of an instruction can be
decode into individual control lines, one line for each
instruction operation. When reading or writing a register
file or memory element, one word must be selected for the
operation. It is a decoder that maps the address (index) to
the control line that will select the appropriate word.

       Figure 6 - Full adder.

       Figure 7 - Ripple carry adder.

The interpretation of an ISA requires the evaluation of
many arithmetic and logical operations. The most common
of these operations are:

   R = A + B + Cin      Addition with carry-in
   R = A - B - Bin      Subtraction with borrow
   R = A & B            Bit-wise logical AND
   R = A | B            Bit-wise logical OR
   R = A @ B            Bit-wise exclusive-OR
   R = ~ A              One's complement (logical NOT)
   R = ~ A + 1          Two's complement (negative)

Addition appears so frequently that an adder is regarded
as a basic building block. (Early machines exploited the
internal logic of the adder to form other operations as well.)
Subtraction is easily obtained from addition by taking
the one's complement of the right hand argument and forcing
the carry-in of the adder to one. Increment by one and decrement
by one may be constructed in a similar manner.

        Figure 8 - Lookahead adder.

The design of addition circuits has been extensively studied.
The two most widely recognized implementations are the
"ripple carry" and "carry lookahead" adders. An N-bit ripple carry
adder consists of N full adder circuits where each full adder
produces a one bit sum with carry-out. By interconnecting the
carry-out of each stage to the carry-in of the next higher stage,
a chain of full adders is formed. Carries will propagate through
the chain one stage at a time. Thus, the worst case delay will
depend upon the length of the carry path. The lookahead
adder speeds the addition process by generating and propagating
two special signals (G and P in Figure 8) across a group of
individual adders, thereby skipping some stages of the carry
chain. Total delay time will depend upon the group size and
the number of groups in the generate and propagate chain.

Texas Instruments introduced the first monolithic arithmetic
logic unit (ALU) in 1973. The 74181 and its successors use
carry lookahead logic for speed and compute all possible
arithmetic and logical operations. Since then, the ALU has
become an essential building block for computer design.

       Figure 9 - Arithmetic logic unit.

An ALU block has two parallel data inputs for the left and
right operands and one parallel data output for the result.
The operation code input selects the arithmetic or logical
computation to be performed. External logic forces the
carry-in to a constant value or the carry-out of some
previous computation for multiple precision arithmetic.
In addition to carry-out, the ALU must produce condition
signals for a negative or zero result and arithmetic
overflow.

            ---
    A ---->|   |----> A < B
           |   |----> A = B
    B ---->|   |----> A > B
            ---

       Figure 10 - Comparator.

Often a programmer is only interested in comparing the
values of two quantities and then making a decision based
upon that comparison. This function is performed by
the "comparator" block (Figure 10.) Testing for zero is
even simpler. The logical NOR of all the bits in the word
will be true when all of the bits are zero.

It is sometimes necessary to rearrange the bits within
a binary word. Shift left and shift right are rearrangements
that move each bit one position to the left or right. Byte
swap instructions which exchange the position of the upper
and lower bytes within a word are another kind of rearrangement.
Finally, some implementations of the Fast Fourier Transform (FFT)
require a complex interchange of address bits for shuffle-exchange
operations. The FFT rearrangement is especially expensive (with
respect to runtime) to implement in software.

       Figure 11 - Shift and swap networks.

Luckily, hardware rearrangement is easy to accomplish -- two
components are interconnected in accord with the desired
rearrangement. Figure 11 illustrates wiring networks for left
shift, right shift and byte swap operations. In the case of
the shifts, "boundary" values must be supplied to set incoming
bits to the appropriate values. When implementing rotation,
for example, the outgoing bits are wrapped around to the incoming
positions forming the rotation.

Section 3 - Programmable logic arrays.

Semiconductor manufacturers provide families of standard
building blocks that cover the most common design situations.
Thus, one expects to find registers, counters, adders or
ALU's in the family. However, an application may demand a
combinational function (or data operator) that is so unique that
it is not included in an industry standard family. The obvious
solution is to implement the combinational function using a
network of NAND and NOR gates. This is an expensive solution
if there are a large number (greater than 4) of inputs,
outputs and terms. A "random" gate implementation will require
several circuit packages and area for the interconnecting wires.

To satisfy these specialized needs, vendors began selling
programmable logic devices (PLD.) Internally, a PLD consists
of a regular array of gate (and sometimes storage) elements
in which the final interconnections have not yet been
determined. The desired logic function is obtained by physically
altering the device thereby making the final interconnect. This
is called "device programming" since the process is similar to
machine level programming. Two kinds of manufacturing and
programming schemes are in common use.

  1. All of the gates are formed on the integrated circuit chip
     with the exception of the final wiring step. The designer
     determines the configuration of the wires which are then
     added to the chip. "Mask programmable" devices of this type
     are suitable for high volume applications as the final
     manufacturing step is relatively expensive.
  2. Electrical connections are established between all of the
     inputs, outputs and gates. The array is programmed by
     selectively breaking any unwanted connections just like
     blowing a fuse. "Fusible link" devices can be programmed
     in the field (e.g., on a lab bench.)

Field programmability comes with a cost, however. Mask programmable
devices have a higher gate density than field programmable circuits
and can accommodate larger combinational functions. If the designer
is careful, the gate network can be partitioned so that more than
one distinct logic function can be implemented within a given PLD.
Thus, the higher density mask devices can support a larger number
of logic functions as well.

Programmable logic devices have several advantages over random
gate implementations. First, the entire logic function can be
implemented in a smaller number of circuit packages, perhaps in
just one pack. This saves board and interconnect area. Next,
less current is required with a corresponding decrease in power
(heat) dissipation. Driving signals off-chip across a circuit
board is a power hungry activity. The gates in a PLD are local
to one another and do not require the higher drive capability of
off-chip circuits. Third, manufacturers can produce programmable
devices in large quantities since a common circuit base can
satisfy many customers through programmability. Finally, a design
is easily changed by reprogramming the device.

       Figure 12 - Two level programmable logic array.

The most common kind of PLD is the programmable logic array (PLA)
which implements the sum of products form of a logic function. A
PLA consists of an AND-plane that contains the product terms and
an OR-plane that produces the sum of the product terms. The AND-plane
is driven by a set of buffering circuits that derive the true and
complement form of each input. Several inputs and outputs are
permitted. The exact number is determined by the specific part
which is selected for the design. The PLA in Figure 12 implements
a full adder. The three inputs are the left and right operand bits
and the carry-in. The two outputs are the sum and carry-out.

In addition to specialized combinational blocks, PLA's may be
used to constructed finite state machine controllers and even
read only memory (ROM.)

Section 4 - Memory.

As the name implies, ROM is a storage element that is written
once and is only read thereafter. Those portions of the ISA
which are fixed are candidates for ROM implementation. In the
PDP-11 series machines, for example, interrupt entries,
the input/output device registers and even the condition
code register are addressable as memory locations. In the case
of the condition code register, the processor must recognize
the octal address 0177776 as the location of the condition code
register and either change or read the condition codes. The
constant 0177776 must be stored somewhere and a ROM is the
likely location. Control information such as the microprogram
emulating the ISA is usually stored in ROM. As noted in Chapter 2,
the availability of inexpensive high capacity ROM led to Complex
Instruction set Computers.

             ---- Chip select
            |   --- Address
            |  |
            V  V
            ----
           |    |
           |    |----> Output
           |    |
            ----

     Figure 13 - Read only memory.

A read only memory has an address input and a data output. ROM's
may be concatenated horizontally to build wider words. If more
memory locations are needed, ROM's may be stacked vertically.
The ROM circuits share the same low order address lines. When
many ROM packages are connected in an array, high capacity
buffering circuits are required to drive all of the address inputs
in the array. A chip enable input is provided to assist the
stacking process. The high order address bits are first decoded.
Then the outputs of the decoded are connected to the chip enable
inputs of the ROM circuits that form a complete memory word.

      Figure 14 - ROM implemented by PLA.

Read only memory can be implemented using a programmable logic
array (Figure 14.) The address is applied to the inputs of the
AND-plane where they are fully decoded. Since each product term
is unique, only one "row" in the OR-plane will be enabled. By
programming the data values into the OR-plane, the ROM data is
selected and sent to the output of the PLA. Please observe that
a PLA is more versatile than a ROM because the AND-plane permits
the use of "don't care" terms in addition to the true and
complement values of the inputs. Thus, a PLA can ignore a condition
or address input. For this reason, a programmable logic array
is better suited for the implementation of finite state machine
controllers that must respond to or ignore external stimuli.

                 ---- Chip select
                |   --- Address
                |  |
                V  V
                ----
               |    |
    Input ---->|    |----> Output
               |    |
                ----

     Figure 15 - Read/write memory.

Random access memory (RAM) is both readable and writable.
The term "random access" refers to our ability to address
any single location in memory at any time. In "sequential
access" memory, however, information is only accessible in
a serial stream of data. To obtain a word of information
in the middle or at the end, the memory is scanned from the
first word and through each succeeding word until the target
word is found. A magnetic tape drive is a sequential access
device. Serial access is generally much slower than random
access since all data items must be scanned in sequence
every time as opposed to the direct accessibility of RAM.

RAM circuits may be concatenated and stacked to obtain
wider words and bigger memories. As in the case of ROM,
the common address lines may need buffering and external
address decoding will be required for stacked arrays.

Most RAM today is electronic and will lose its data if
power is lost. Although the speed of electronic RAM approaches
register speed, the high density afforded by RAM make it the
best choice for primary memory. RAM may also be used to
construct cache memories and writable control stores. A
cache memory speeds up access to primary memory data by
maintaining frequently used values within the processor itself,
thereby avoiding relatively lengthy processor to memory
transactions. A writable control store (WCS), like a ROM control
store, holds the ISA microprogram. Since it is writable,
however, the microprogram may be changed and programming bugs
are easily fixed. The customer may also run a different
microprogram possibly emulating another ISA.

Section 5 - Selection.

Data selectors allow the machine to select among one of
several different arguments or results. For example, if
two different data values must be stored in the same
register (at two different times, of course!), the values
must be routed to the register through a selector. The
value of the control input to the selector will choose one
of the two inputs at execution time.

                 ---
    Input0 ---->|   |
    Input1 ---->|   |----> Output
      ...       |   |
    InputN ---->|   |
                 ---
                  ^
                  |
               Select

      Figure 16 - Multiplexer.

               Gate
                |
              -----
               /
    Input ____/   ____ Output

       Figure 17 - Switch.

Three kinds of selection elements are available: multiplexers,
switches, and bi-directional data buffers. A multiplexers works
exactly like the selection element described above. It has N
data inputs, one selection input and one data output. A switch
element is a voltage (or current) controlled switch. nMOS
pass transistors and CMOS transmission gates are two examples
of a switch circuit. The path from the input to the output is
controlled by the gate input. When the gate input is high, the
path is established and the information flows through the switch.
Data may be routed by connecting switches into a series/parallel
network and sending selection information to the gates.

     Input0      Input1      Input2
       |           |           |
       @-Enable    @-Enable    @-Enable
       |           |           |
    ----------------------------------- Shared bus

       Figure 18 - Bi-directional data buffers.

Figure 18 shows another alternative for selection. In this
illustration, three inputs are connected to a shared bus
through bi-directional data buffers. An input value is placed
onto the bus by asserting the enable input of its bi-directional
driver. Since the shared bus cannot be simultaneously driven by
two or more senders, just one of the "Enable" inputs may be
asserted at any time. TTL tri-state buffers are an example
of a bi-directional data buffer. The output of a tri-state
buffer is turned on (driven) when deliberately enabled. When
the output is turned off, it places the driver into a high
impedance state which permits one of the other drivers on the
bus to send data.

The advantages and disadvantages of the three techniques
break down in the following way.

  Multiplexers are available in every circuit technology
  (logic family.) They are active circuits that consume power
  and force input wires to be routed to a central place.

  Switch elements consume very little or zero power. Due
  to signal losses through the switch, restoring logic
  (amplification) is required. They are not available in all
  circuit technologies.

  Bi-directional data buffers are excellent for distributed
  busses. Care must be taken, however, to guarantee a single
  sender at any time. Long busses (e.g., across a chip or
  backplane) must be driven hard for high speed with a
  corresponding increase in current consumption and heat dissipation.

Clearly, the choice  of one scheme over another is an engineering
decision affected by the overall design context.

Section 6 - External interface.

External communication often demands special circuits and
provisions for compatibility. For example, terminals, modems
and computers intercommunicate using the RS-232C electrical
standard. RS-232C specifies the voltage levels, connector pins
and signal interpretations that permit any device adhering
to the standard to correctly exchange information. When an
interface becomes a standard, manufacturers will produce
components that directly support the standard making design
task easier. RS-232C receivers and drivers and asynchronous
line interfaces, for example, are readily available.

VLSI technology has special requirements for external communication.
Signals must eventually be driven off-chip to other subsystems
and of course, must be received as well. The microscopic features
of an integrated circuit chip must first be connected to the
macroscopic world of printed circuit boards and wires. This
function is performed by the package which surround and protects
the chip from mechanical damage. Thin gold wires are bonded to
the chip and the package pins forming the electrical bridge.

        Figure 19 - External interface.

To support off-chip connections, large metal regions called
"bonding pads" must be allocated on the chip (Figure 19.) The pads
are positioned around the edge of the chip permitting easy access
for wire bonding. The pads are physically large and consume
a substantial portion of total chip area. Further, the
capacitance of off-chip wires is much larger than that of
on-chip wires. Big drive transistors must be designed into
the interface to drive external connections (charge and discharge
the wires) without a performance penalty. As much as 40 percent
of total circuit power may be dissipated by the output drivers.
External interface circuits are not just big -- they may
be power hungry, too. Therefore, there is considerable pressure
to minimize off-chip connections and keep the pin out low.

Section 7 - Physical characteristics.

All physical circuits occupy space, consume electrical
current and generate heat. They take some time to switch
on and off and to communicate -- delay. The design
projects in this book use a set of circuit primitives
that a similar to "standard cell" integrated circuit
technology. Area, supply current, power dissipation and
worst case delay are summarized for each building block
in the table below.

Standard cells          Blocks  Supply   Power   Delay
----------------------  ------  -------  ------  -------
8 x 8 register file     4 x 5    7   mA  35  mW  30 nsec
16 x 16 register file   8 x 10   25  mA  125 mW  35 nsec
8-bit register          4 x 1    0.8 mA  3.7 mW  20 nsec
16-bit register         8 x 1    1.5 mA  7.5 mW  20 nsec
8-bit counter           5 x 1    1.0 mA  5   mW  20 nsec
16-bit counter          10 x 1   2.0 mA  10  mW  25 nsec
Control store           12 x 6   50  mA  250 mW  35 nsec
ALU                     8 x 4    20  mA  100 mW  30 nsec
16-bit 2:1 mux          4 x 1    0.8 mA  4   mW  10 nsec
16-bit 4:1 mux          4 x 1    0.4 mA  2   mW  10 nsec
8-bit external input    8 x 2    0   mA  0   mW  0  nsec
16-bit external input   16 x 2   0   mA  0   mW  0  nsec
8-bit external output   8 x 2    20  mA  100 mW  100 nsec
16-bit external output  16 x 2   40  mA  200 mW  100 nsec
8-bit ext. tristate     8 x 2    25  mA  110 mW  100 nsec
16-bit ext. tristate    16 x 2   45  mA  225 mW  100 nsec

Copyright (c) 1987-2013 Paul J. Drongowski