ENGR857 HW2
Homework # 2 Due: Oct 31st
Building FPGAs ( Copyright: materails borrowed from Russell Tessier at University of Massachusetts)
1 Introduction
FPGA design is about tradeoffs. While the basic architecture of most devices is similar,each vendor device typically has distinguishing features that set it apart from the others. In creating new FPGA architectures designers must explore implementation tradeoffs that lead to effective use of VLSI area. In this exercise, you will examine several of these issues.
This assignment focuses on the island-style FPGA architecture we have been discussing in class. To complete this assignment you will perform a number of experiments using twobenchmark circuits commonly used for FPGA research. Several academic CAD tools willbe applied to these circuits to help you learn more about the process of translating a text easily be modified. Our goal for this exercise is not so much to change the functionality of the tools (this will be explored in HW3) but rather to examine the effect of architectural issues such as logic block size and wiring network connectivity on the overall size of a
target device.
2 Island-style FPGAs
First, please read [2]and [1]. A brief summary of the FPGA model presented in these sources in presented below.
The model for FPGA architecture used in this assignment (shown in Figure 1) models the architecture of several commercial FPGAs from Xilinxand Altera at a high level. The three major architectural parameters in the model are the amount of logic present in each island location labelled L, the connectivity between the logic elements and the switching network at locations labelled C, and the connectivity between wires in the switching network at locations labelled S. In this assignment you will have the opportunity to explore each of these parameters and understand functional tradeoffs.

Figure 1: Island-style FPGA model
Each logic cluster L contains collections of one or more look-up tables and flip flops as described in detail in Section 2 of [2] and shown in Figure 2. The output from each LUT-FF pair or basic logic element (BLE) can be either registered or unregistered as needed. Two architectural parameters control the functionality of the logic cluster, N and I. N is the number of BLEs per cluster and I is the number of cluster inputs. For this assignment, each cluster input can be fanned out to any BLE input through an input
multiplexer. Note that all BLE outputs loop back around the BLEs and can be used as BLE inputs.
In Figure 1, routing channels of width W (in this case 3) are connected to logic clusters through a set of programmable switches, referred to as connection or C blocks, at the intersection of logic cluster IO terminals and channel tracks. The flexibility of the C block or Fc is represented by the fraction of channel tracks that can possibly connect to a logic element. In the example shown in Figure 1b only two of the tracks could possibly attach to either an input or output so Fc = 0.66. Note that in some literature (e.g. [5]) Fc is defined as the absolute number of tracks connected to the cluster I/O, so Fc = 2 for Figure1b using this notation.
Wire segments in the routing channels span one logic cluster in the horizontal or vertical dimension. Switchboxes, or S blocks, allow a predefined set of programmable connections between wires at the intersection of horizontal and vertical track channels. Figure 1a shows that each switchboxis sparsely connected so that each horizontal or vertical wire entering the switchboxcan connect to only three possible destinations (Fs = 3).

Figure 2: Logic Cluster
3 Getting Started
Completion of the assignment will require the use of three academic CAD tools written by Vaughn Betz and Sandy Marquardt at the University of Toronto. These tools are applied to several benchmark logic design circuits from the MCNC FPGA benchmark suite. The source for these tools is well commented and you are encouraged to look over parts of the code to better understand how results are generated. Needed benchmark circuits are located in file hw.tar.gz. VPR, version 4.30 (go to the VPR web site and download the source for VPR, version 4.30. Your downloaded file (vpr_430_tar) will also include t-vpack.), can be obtained from the VPR web site. Please follow the directions on the course web site to obtain the necessary tools and configure them on your computer. The standard distribution of VPR works on Sun Solaris. If you use the makefile, VPR should compile under Linux on server wildfire. VPR has also been successfully compiled for other architectures (HP Unix,
MS Windows). You are free to use any type of computer for this assignment.
Two benchmark circuits, apex2 and pdc, and associated assignment makefiles may be found in subdirectory tests in the distribution. These should be used for all exercises. Initially circuits are represented as a collection of lookup tables and flip flops described in blif, a common language for describing circuits. These circuits will be translated into logic clusters using T-VPack, a logic clustering tool, and placed and routed using VPR, an academic placement and routing tool. Following layout, trans count is used to evaluate the total number of routing and logic transistors needed to successfully implement the circuit in a minimum sized island-style FPGA. It is suggested that you look over the VPR and T-VPack User¡¯s Manual, manual430.pdf, prior to starting the following experiments. The T-VPack description will help you understand how T-VPack operates and the role of various parameters. You can learn more about trans count parameters by going to the trans count directory and typing trans with no arguments.
4 Varying Logic Cluster Size
For the first set of exercises you will determine the relationship of I, inputs per logic cluster, to N, BLEs per logic cluster, for several cluster-based island-style devices. We shall determine the appropriate number of inputs per cluster by repeatedly applying TVPack with different numbers of inputs specified per cluster. A similar set of experiments is described in Section 5 of [2] and Section V.A of [1].
Ex 1: Logic Clustering Algorithm
Summarize the clustering algorithms described in Section 4 of [2] and in Section 6.2 of [5]. How are the approaches similar? How are they different? Please limit your discussion to somewhere between half and a whole page.
According to the given clustering algorithm, the number of inputs required by BLEs derived from the logic design and assigned to a logic cluster must be less than the number of inputs available, I, even if some BLEs in the cluster must be left unpopulated. Clearly, minimizing I in an architecture is beneficial since fewer inputs per cluster means fewer input switches and potentially more device area for a larger logic array. However, if I is I value per cluster that still allows high total BLE utilization in the device. The number of potential values of I for a cluster range between 4, the number of inputs for one BLE,and N ¡Á 4, the total number of all BLE inputs in the cluster.
Ex 2: Comparing N to I
For designs apex2 and pdc repeatedly apply T-VPack to evaluate tradeoffs between I,N, and the number of logic clusters needed to implement a design. This analysis can be accomplished with the following example steps:
? Examine the makefile in directory tests/apex2. Note the variables CLUSTER SIZE and INPUTS PER CLUSTER assign N = 4 and I = 10 for this example. Before making changes to the makefile make a copy of it for safekeeping. Change the variable CPATH to point to your local copy of the CAD tools.
? Read over the description of T-VPack in manual430.pdf paying particular attention to the parameters the tool takes as input. Compare these to the parameter setting located in the makefile. Run T-VPack by typing make apex2.net. The result of T-VPack is a netlist which can be used for subsequent circuit placement and routing. Repeat this experiment for N = 4, I = 4, 8, 9, 10, 11, 12, 13, 14, 16 and N = 8, I =4, 8, 12, 16, 18, 19, 20, 21, 22, 24, 28, 32 for both apex2 and pdc. To simplify this task you may wish to modify and use the shell script foreach.script. Plot results so that the fraction of total available BLEs used per design appears along the vertical axis and I4N appears along the horizontal axis. What can you summarize about I for these experiments? What minimum values of I achieve 98% BLE utilization for each design and value of N?
5 Evaluating VLSI Area Costs
An important aspect of FPGA design is understanding how much VLSI area is needed to implement a specific circuit. In the previous section we evaluated tradeoffs in logic block size. Here, we include routing area in the evaluation.
To a first order the area of an FPGA device can be estimated by determining the number of minimum-width transistors needed to construct the device. While a full device layout would be needed to get the most accurate assessment, in this exercise we compare FPGA array sizes by counting the number of transistors needed to implement both interconnect and logic for various device sizes. In general, interconnect transistors include those found in C-blocks and S-blocks and logic transistors include those found in logic clusters. Fringing effects due to IO pad connections are not considered in this analysis. In the following
exercises you will take netlists created by T-VPack and apply VPR to first place the circuit in a device with just enough logic clusters to hold it and then route interconnections using the minimum number of tracks per channel needed to route it, Wmin. Although the placement and routing processes are controlled by makefile settings that have been preset for the following experiments, you are encouraged to experiment with them to become familiar with the tools.
Following placement and routing, trans count is used to determine the number of logic and routing transistors needed to implement each circuit. This program takes as input the minimum number of tracks per channel found from routing (Wmin), the number of blocks per cluster (N), number of inputs per cluster (I), Fc, and Fs. All of these parameters are predetermined except for Wmin which is found after routing by VPR.
If you wish, you can use routing area numbers reported from VPR rather than trans count for the following experiments. These numbers are somewhat more accurate than the numbers reported by trans count as they take more routing factors into account. Note that you will still need to determine logic cluster transistor counts using trans count if you chose to determine routing area using the VPR reported values.
Perhaps the best way to learn about how these tools work is through an example. Consider
? For design apex2, use VPack to create a .net file with N = 4 and I = 10. Note that the ARCH FILE variable in the makefile is 4x4lut sanitized.arch to allow for VPR compilation with N = 4 and Fc in the makefile is 0.5, the connection block flexibility for N = 4. Note that these two values should be set to 4lut sanitized.arch and Fc = 1, respectively, for later experiments with N = 1.
? Create a placement for the design by typing make apex2.p. Pay special attention to the size of the array that is targetted. VPR determined that this was the minimum square array size that would hold this circuit.
? Route the new placement by typing make apex2.r. Note that the router will try routing at a number of track widths until a minimum value, Wmin, is found that will successfully complete routing. Note: routing for each design will require several minutes.
? In the makefile, change the variable WIDTH to Wmin found in the previous step. Now type make apex2.area to get the per logic cluster area of the device. This should be scaled by the array size to determine the total number of transistors needed to implement the design.
Ex 3: Evaluating Area Costs
Repeat the above experiment for design pdc, N = 4, I = 10 and for both apex2 and pdc, N = 1, I = 4. Be sure to check Fc and WIDTH each time you run trans count so that you obtain accurate results. As noted in the makefile Fc should be set to 1 for N = 1 and 0.5 for N = 4. Summarize the results of your four experiments in a table with a brief description of what you found. The table should include the number of transistors needed to implement the whole device in addition to the number of transistors per logic cluster. Where are most of the transistors located in the FPGA, in the logic or in the interconnect? Explain why Fc should be less for larger cluster sizes. Consider using a script (like foreach.script) to speed up your experiments. Optional: Try other cluster sizes such as N = 8. Examine file trans count/trans logic.c to determine exactly how transistors that make up logic blocks are allocated.
References
[1] E. Ahmed and J. Rose. The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density. IEEE Transactions on VLSI, Mar. 2004.
[2] V. Betz and J. Rose. Cluster-based Logic Block for FPGAs: Area-Efficiency vs. Input Sharing and Size. In Proceedings, Custom Integrated Circuits Conference, 1997.
[3] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and Single-driver Wires in FPGA Interconnect. In IEEE International Conference on Field Programmable Technology, Brisbane, Australia, Dec. 2004.
[4] D. Lewis. The StratixI I Logic and Routing Architecture. In International Symposium on Field Programmable Gate Arrays, Monterey, Ca., Feb. 2005.
[5] A. Marquardt, V. Betz, and J. Rose. Using Cluster-Based Logic Blocks and Timingdriven Packing to Improve FPGA Speed and Density. In International Symposium on Field Programmable Gate Arrays, Monterey, Ca., Feb. 1999.