Computer Design A computer aided design and VLSI approach Paul J. Drongowski Chapter 11 - Speed, space and power estimates. Ultimately, the system must provide adequate performance and satisfy any engineering requirements (unit cost, size, packaging, power supply, cooling.) Procedures for estimating speed, space and power characteristics are given in this chapter. We will use a fictitious set of building blocks which resemble VLSI standard cells to illustrate the estimation process. Section 1 - Introducing standard cells. Standard cells are an important technological choice for the implementation of a computing system. (Chapter 16 will look at standard cells in more detail.) A standard cell is a VLSI building block which performs a common operation such as an adder, register, register file, etc. Design begins in the usual way -- a block diagram is drawn for the data and control paths along with the control flow graph. To obtain a completed design that is ready for fabrication, the standard cells must be placed on the surface of the chip ``layout'' and interconnected using silicon (or metal) wires. Any signal coming into or leaving the system must pass through special input or output circuits which condition the signal for the external environment and provide bonding pads. The bonding pads will be used to connect the chip with the DIP leads. Because circuits are physical devices, they consume space and power, dissipate heat and have some minimum switching speed. The tricky part of VLSI design is choosing a dense arrangement of circuits within the confines of a fixed size chip. Input, output and tristate circuits must be placed along the borders of the chip to make wire bonding easier. Power consumption and dissipation must be within acceptable limits. Switching speed and propagation delays depend heavily upon wire lengths. Shorter wires have lower capacitance and shorter propagation delays. Critical path components should be placed near each other, thereby reducing the length of the wires (and delays!) along the critical timing path. Off-chip interconnections are the worst -- capacitances are 50 to 100 times higher than on-chip wires. Power consumption heat dissipation and delay are proportionately higher (if we attempt to keep delay constant.) Standard cells Blocks Supply Power Delay ---------------------- ------ ------- ------ ------- 8 x 8 register file 4 x 5 7 mA 35 mW 30 nsec 16 x 16 register file 8 x 10 25 mA 125 mW 35 nsec 8-bit register 4 x 1 0.8 mA 3.7 mW 20 nsec 16-bit register 8 x 1 1.5 mA 7.5 mW 20 nsec 8-bit counter 5 x 1 1.0 mA 5 mW 20 nsec 16-bit counter 10 x 1 2.0 mA 10 mW 25 nsec Control store 12 x 6 50 mA 250 mW 35 nsec ALU 8 x 4 20 mA 100 mW 30 nsec 16-bit 2:1 mux 4 x 1 0.8 mA 4 mW 10 nsec 16-bit 4:1 mux 4 x 1 0.4 mA 2 mW 10 nsec 8-bit external input 8 x 2 0 mA 0 mW 0 nsec 16-bit external input 16 x 2 0 mA 0 mW 0 nsec 8-bit external output 8 x 2 20 mA 100 mW 100 nsec 16-bit external output 16 x 2 40 mA 200 mW 100 nsec 8-bit ext. tristate 8 x 2 25 mA 110 mW 100 nsec 16-bit ext. tristate 16 x 2 45 mA 225 mW 100 nsec Table 1 - Fictitious standard cell set. Section 2 - Clock period. In order to determine overall system performance, we must first compute the expected clock period. This time will be determined by the longest delay path. The longest delay path is found by tracing the flow of information through each of the paths leading from the source (operand) storage elements through any data operators or selection blocks to the destination storage elements. The delay for each path is the sum of: * The time required to read an operand value (if any), * The switching delays incurred by the data operators and selectors, and * The set-up and hold times of the destination storage elements. As we will see in a later chapter, the delay times are perturbed by the length of the wires interconnecting the blocks -- a significant source of delay if the wires are long. The longest path in Figure 1 starts at block Source-0 and ends to Destination-1. Assuming a 20 ns combined set-up and hold time (from Table 1), the path delay and minimum clock period is 90 ns. ------- | | Source ----->| 50 ns |-----------------> Destination 0 | | | 0 ------- | Source -- V 1 | ------- ------- -->| | | | | 25 ns |---->| 20 ns |----> Destination -->| | | | 1 | ------- ------- Source -- 2 Figure 1 - Longest delay path. Programmable and multiphase clocks complicate the analysis a bit. Instead of estimating a single clock period, the designer must determine minimum times of the individual clock phases or periods. The same procedure can be used, but the assignment of data or storage operations to phases (periods) must be taken into account. This procedure sums the individual delay times of the components. Logic circuits are not strictly on-off devices, but possess an analog character as well. Thus, successive stages of logic will begin to switch even though the earlier stages have not achieved their final minimum or maximum voltage values. A simple sum will yield a conservative pessimistic estimate. For MOS combinational logic, the square root of the sum of squares provides a better approximation to actual delay. The expected clock period for the example in Figure 1 is 54 + 20 = 74 ns using this method. | * * * * | * * | * * | * * | * * | * * | * * | * -----|---------------|----- Min Typical Max Reject | Acceptable | Regrade Figure 2 - Distribution of delays. The integrated circuit fabrication process is not perfect and real component delay times will obey a statistical distribution over time (Figure 2.) The published specifications for a part will show three delay values: minimum, typical and maximum. Delivered parts will operate within the stated minimum and maximum delay values and the average of the distribution will be the typical (or "nominal") delay. Devices that operate faster or slower can be regraded and sold at either a premium or reduced price (respectively.) If the distribution is symmetric about the mean, then one would expect half of the parts to run slow and the other half to run faster than the typical delay value. The actual speed of an assembled product depends upon the characteristics of the its components. Two alternatives are possible. * The designer is optimistic and assumes that fast parts will offset the increased delay of the slower parts. Optimists use typical delays when computing the clock speed. * The designer is pessimistic and performs a worst case computation using stated maximum delays. In the second case, the clock period will always be sufficiently long to guarantee reliable operation of production units. However, worst case analysis may force the designer to set artificially high and costly timing goals for the design. The optimistic designer, however, has made one bad assumption -- that all components have equivalent delay times which will compensate for one another. This assumption does not always hold. For example, the increased delay of a relatively slow ALU is not necessarily offset by a faster NAND gate. Increased computation time will reduce storage set-up time below the necessary minimum and correct, reliable operation will cease. A portion of the production units will fail for this reason. The percentage of failed units can be reduced by adding a safety margin (fudge factor) to the clock period calculation. VLSI designers have a slightly different problem. Since all building blocks reside on the same chip and are manufactured under the same processing conditions, the speed of all blocks will either increase or decrease. (We are assuming uniform processing across the die. This assumption definitely does not hold for wafer scale systems.) The designer must choose a clock period that guarantees an economically viable yield-- the percentage of good parts after fabrication. If the target clock speed is set too high, then too few devices will be fabricated with acceptable minimum performance. Low yield usually means low profit as the expense of fabrication must be spread across the price of the working chips. Section 3 - Control graph analysis. Once the clock speed has been determined, the control graph can be analyzed. The following procedure is suggested. * Label each control event symbol with its expected execution time. This time will vary from event to event in systems employing a programmable clock. * For each instruction or system operation, trace the execution flow of the control graph beginning with instruction fetch and ending with the last event needed to interpret the instruction. Conditional branches and loops pose an interesting analytical problem. One solution is to trace each execution path and build a table of conditions and execution times. In the case of the SP.1 branch on zero instruction, the table below: Condition | Time ------------------------- Z = 0 | 100 ns Z = 1 | 200 ns gives the execution time of the conditional branch instruction when the Z flag is both true and false. Loop times may be parameterized in the number of iterations around the loop. Section 4 - Space. Naturally, physical circuits occupy space. A variety of packaging and interconnection techniques are available and usually two or more different methods will be used in the mechanical design of the product. (Packaging and interconnection will be discussed in detail in Chapter 17.) * At the printed circuit boards level, packages (dual in-line, pin grid array, leadless chip carrier, etc.) are interconnected by metal wires routed on one or more board layers. Board area is finite (obviously) and is often determined by the standard form and mechanical design employed by a vendor. * VLSI chips consist building blocks that are interconnected by semiconductor and metal wires. Chip area is limited by the defect density of the fabrication process and the desire for acceptable production yields. Thus, the bottom-up space analysis to be performed must determine if the data and control circuits can fit onto a board or chip of the desired size. If the design is too big, it must be broken across chip or board boundaries. To estimate size, the designer drafts a scale drawing of the physical layout of the system. The "floor plan" shows the relative sizes and placement of the building blocks on each chip or board. Design experience and maximum power dissipation levels help to determine the maximum number of blocks that can fit onto the floor plan. On-chip wiring is a significant concern in VLSI design as up to 80 percent of chip area may be required for interconnect. System designers always try to use the most inexpensive cooling method possible. The fundamental physical limitations on heat transfer may dictate the use of fewer components or a lower power circuit technology. Percentage of active and wire area and the number of circuits or pins can be used as rough size (and cost) measures. Section 5 - Power. Current consumption and power dissipation values are calculated by summing the individual current and power values of the components in the system. Total heat dissipation will indicate if system packaging and cooling are sufficient. If the expected power dissipation is too high, then either a more expensive form of cooling or packaging will be required or a lower power design must be found. The computed current consumption indicates the size (and cost) of the power supply for the product. This value is also used to size the power distribution grid to assure that power bus wires will have sufficient cross-sectional area to carry power with a safe current density. Actual system current consumption and power dissipation depends the operational characteristics of the circuit technology. In nMOS and bipolar technologies, designers assume that half of the gates will be switched high and the other half will be switched low. Thus, actual current and power will be 50 percent of their worst case values. (The percentage by which a value is derated is sometimes called the "duty cycle.") Power dissipation in CMOS circuits depends upon the clock (switching) frequency since more current flows through a CMOS while it is switching than while it is quiescent. Area, current and power dissipation calculations take the form of a spreadsheet (Figure 3.) The percentage of active area is the area occupied by the blocks divided by the total area of the system (chip.) Quantity | Component | Area | Current | Dissipation --------------------------------------------------------- 1 | Register file | 80 | 25 mA | 125 mW 1 | ALU | 32 | 20 mA | 100 mW 4 | 2:1 multiplexer | 16 | 3 mA | 15 mW --------------------------------------------------------- 128 48 mA 240mW Figure 3 - Area, current and power spreadsheet. Section 6 - Design problem ???. For this problem, you must design the data and control paths for the SP.1 computer using the set of standard cells given in Table 1. Your design documentation should include the following items. * A block diagram for the system showing the cells and their interconnections. * A control graph which will become the microcode (controller programming information) for the system. * A floor plan which shows the physical placement of the cells on one or more chips. * Estimates for clock period (microinstruction cycle time), power consumption, power dissipation, percentage of active (cell) area, and percentage of wire area. * An essay describing your design considerations and trade-offs. The floor plan must be drawn on graph paper. Total chip area is 18 by 18 blocks and at a minimum, forty percent of chip area should be allocated for wiring. The maximum amount of power which the system may dissipate is two Watts. A few notes on Table 1 are in order. Cell sizes are given in "blocks." When laying out the system, you should assume that data flows into one side of a cell and out the opposite side and that parallel data will arrive and leave a cell along its broadest sides. Current consumption and power dissipation values are maximums and may be scaled by the expected duty cycle. Delay times are also worst case. The delay time for a storage component is the combined set-up and hold time of the standard cell. Copyright (c) 1987-2013 Paul J. Drongowski