Anh Đức Blog: Phu luc C.html

Phu luc C.html

ADSL, VDSL, and Multicarrier Modulation . John A. C. Bingham Copyright # 2000 John Wiley & Sons,Inc. Print ISBN 0-471-29099-8 Electronic ISBN 0-471-20072-7

APPENDIXC

EFFICIENT HARDWARE IMPLEMENTATIONS OF FFT ENGINES

Mitra Nasserbakht Intel Corporation,E-mail:Mitra.Nasserbakht@intel.com or Bite 5000@aol.com
C.1 OVERVIEW

Ef®cient algorithms for computing the discrete fourier transform (DFT) have enabled widespread access to Fourier analysis in numerous ®elds. These application areas span diverse disciplines such as applied mechanics and structural modeling to biomedical engineering. In the signal-processing arena, Fourier theory has been widely used for signal recognition,estimation,and spectral analysis. Fourier analysis has been at the core of many communication system subblocks,such as those used for echo cancellation,®ltering,coding,and compression. The ability to compute DFT in realtime and with minimal hardware is the key to the successful implementation of many of these complex systems.

The fast fourier transform (FFT) is an ef®cient algorithm for computing the DFT of time-domain signals. The focus of this chapter is on the necessary ingredients for the design of FFT processing engines capable of handling data of a real-time nature found in most digital signal processing and telecommunications applications.

This appendix starts with a brief overview of the FFT and its computation. Top-level system requirements,addressing,arithmetic processing,memory subsystem,and data ordering are discussed in Section C.3. Section C.4 is devoted to a discussion of implementation issues for a representative FFT processing engine.

C.2 FAST FOURIER TRANSFORM
In 1807,Joseph Fourier described the Fourier series representation of signals where any periodic signal could be represented by the sum of scaled sine and 259

cosine waveforms. The signal can thus be represented by a sequence of an in®nite number of scale factors or coef®cients. Given such a sequence,we can also determine the original waveform. For Fourier series representation of ®nite sequences,we deal exclusively with the discrete Fourier transform (DFT).

At the core of the computation are the following two equations that describe the N-point DFT of a data sequence x(k):
X
X…i†ˆ x…k†W^ik …C:1†
kˆ0
and the inverse DFT of a transform sequence X(i):
X¹
x…k†ˆ¹ X…i†W^Àki …C:2†_{N iˆ0}

where { x(k)} is a vector of N real samples,{X(i)} is a N-complex vector with Hermitian symmetry,and W ˆexp[Àj(2/N)] are called the twiddle factors. The number of operations for computing DFT of a signal using direct DFT method is proportional to N².

The invention of FFT is attributed to Cooley and Tukey in 1965. Fast Fourier transforms compute the DFT with greatly reduced number of operations. By adding certain sequences of data after performing multiplication by the same ®xed complex multipliers,FFT eliminates redundancies in brute-force DFT computations. This ef®ciency in computation is achieved at the expense of additional reordering steps to determine the ®nal results. These additional steps, once implemented ef®ciently,have negligible overhead on the overall compute engine. As a result,FFT is an extremely robust algorithm that lends itself well to machine computation. Since most applications of FFT deal with real-time data sequences,full hardware implementation of the algorithm is desired. Due to the complexity and large amount of hardware required for this type of implementation,many trade-offs need to be considered; these are outlined in the upcoming sections.

For large values of N,computational ef®ciency can be achieved by breaking the DFT into successively smaller calculations. This can be achieved in the time or frequency domain,as discussed in the following sections. The main observation made by Cooley and Tukey was that when N is not prime,the DFT computation can be decomposed into a number of DFTs of smaller length. This decomposition can be continued until the prime radix for the given value of N is reached. For the case where N is a power of 2,the total number of operations is reduced to on the order of N log₂N.

C.2.1 Radix-2 FFT Computation
In a radix-2 implementation,the basic building block is a two-point butter¯y shown in Figure C.1. It has two inputs (x[0], x[1]) and two outputs (X[0], X[1])
FAST FOURIER TRANSFORM 261
Figure C.1 Basic radix-2 butter¯y.
and consists of one complex multiply and two complex adds:
^{X‰1Šˆ x‰0Š‡ x‰1Š …C:3†}X‰0Šˆ x‰0Š‡ W‰1ŠÃ x‰1Š

In computing the FFT for larger than two-point sequences,data points are successively partitioned according to methods outlined in Sections C.2.3 and C.2.4 until the basic butter¯y,shown in Figure C.1,is reached. Figures C.3 and C.4 depict examples of such partitioning. Radix-2 implementations have minimal demand on memory subsystem sophistication but are not the most ef®cient for large data sequences.

C.2.2 Radix-4 FFT Computation

In a radix-4 implementation,the basic building block is a four-point butter¯y,as depicted in Figure C.2. A basic implementation of this requires four additions and three complex multiplication operations:

X ‰0Šˆ x‰0Š‡ x‰1Š‡ x‰2Š‡ x‰3Š
X‰1Šˆ…x‰0ŠÀ j x‰1ŠÀ x‰2Š‡ j x‰3Š†W¹
X‰2Šˆ…x‰0ŠÀ x‰1Š‡ x‰2ŠÀ x‰3Š†W^{2 …C:4† 3}X‰3Šˆ…x‰1ŠÀ j x‰1ŠÀ x‰2ŠÀ j x‰3Š†W

Figure C.2 Basic radix-4 butter¯y.
Similar to radix-2 computation,larger values of N need to be divided into groups of 4 this time until the basic butter¯y is reached.
C.2.3 Decimation in Time

As mentioned above,it is possible to break up the input sequence to be transformed into successively smaller time-domain sequences,hence the name decimation in time (DIT). The DIT can be interchangeably used with decimation in input for the case of a DFT-only engine. In a more general case,input can be the data sequence or its respective transform sequence. For the special case of N being a power of 2,the input sequence is divided into its even and odd parts,and each is so further divided until the base of the algorithm is reached (two points in a radix-2,four points in a radix-4 algorithm,etc.) This requires logr N stages of computation.¹The total number of complex multiplies and adds will therefore be N logr N.

An example of FFT decomposition is depicted in Figure C.3 for N ˆ 8 and a radix of 2. The DIT decomposition is sometimes referred to as the Cooley±Tukey method of computing the FFT.

C.2.4 Decimation in Frequency

When performing pure DFT functionality,the compute engine may be designed to divide the frequency domain (output in this case) into smaller sequences. The total number of computations remains the same as for DIT-type algorithms: N logr N,with the number of complex multiplies being (N/radix) logr N and the number of complex additions being N logr N.

In general,a great degree of correspondence exists between the DIT and decimation in frequency (DIF) algorithms whereby interchanging one’s outputs and inputs and reversing the direction of the ¯ow graph of computation would yield the other method. In addition,if the input data are entered in normal order, proper permutations must happen to produce bit-reversed ordered data before the ®nal results are returned.

An example of FFT decomposition is depicted in Figure C.4 for N ˆ 8 and a radix of 2. The DIF decomposition is sometimes referred to as the Sande±Tukey method of computing the FFT.

C.3 ARCHITECTURAL CONSIDERATIONS

In specifying the architecture and design of an FFT engine,there are several key factors to consider. First is the basic FFT computational building block. At a high level,this affects the radix chosen along with memory system availability and requirements. At a lower level,it determines the details of how the radix-x processor engine is to be implemented. These are discussed in Section C.1.1.

¹For convenience,logr is written for log_radix.
Figure C.3 Flow graph of FFT: decimation in time (DIT) with N ˆ 8.

Another important consideration is memory access. This deals with the number of memory devices to be used,and most important,the addressing scheme used in the implementation of the FFT engine. For space conservation, in-place algorithms are almost universally preferable. In such cases,care must be taken in choosing an appropriate addressing scheme for both data and the corresponding twiddle factors used in different stages of FFT computation,as discussed in Section C.2.

Finally,there are details to consider pertaining to the order in which data and twiddle factors enter the computation engine as well as scrambling and descrambling of data. Some of these considerations are discussed in Sections C.1.1 and C.2.

C.3.1 Number Representation Scheme

One of the most important factors in deciding the architecture of a numbercrunching machine is its number representation scheme. This decision directly affects the storage requirements of the machine. In applications such as FFT or

Figure C.4 Flow graph of FFT: decimation In frequency (DIT) with N ˆ 8.

other signal processing applications,an increasingly larger amount of on-chip memory is required to support the real-time nature of these applications. Saving a few bits for each data point that needs to be stored in the RAMs will save a large percentage of chip area in such applications.

Twos-Complement Versus Sign-Magnitude. The number representation scheme dictates the type of computational elements used in the architecture. If any data representation system other than the standard twos-complement binary is chosen,one must make a trade-off between taking advantage of the number representation system in computational elements as well,or use standard twoscomplement binary arithmetic elements and make appropriate conversions between data formats when necessary. More details of such a trade-off are covered in the Section C.4.

Floating-Point Versus Fixed-Point. The choice of number representation scheme is also driven by the need to achieve maximum “dynamic range” for the available hardware and to minimize error accumulation due to lack of precision. The ®rst ¯oating-point digital ®lter was built in 1973. Since then several generalpurpose DSP machines have been developed to accommodate high-precision calculations. Floating-point signal processing has several advantages over ®xedpoint arithmetic,the most notable of which is its superior dynamic range behavior. In ®xed-point number representation schemes,the dynamic range grows only linearly (6 dB/bit) with increasing bit width,while for ¯oating-point numbers it grows exponentially with increasing exponent bit width. The use of ¯oating-point number representation improves the dynamic range,and it may be the only scheme that provides proper signal/noise ratio (SNR) in the presence of storage limitations.

The improvement in dynamic range comes at the expense of increased hardware complexity. In the case of an FFT engine,this hardware overhead is multiplied by the requirement of having to store initial and ®nal results as well as intermediate values and keeping proper levels of precision.

Due to the inherent round-off noise in computation-intensive applications,the number representation scheme and the carrying of the round-off noise through the computation become essential factors affecting the ®nal performance of the FFT engine. The processor needs to allocate enough internal storage between RAM stages to ensure that the SNR never falls below the minimum system requirements.

In general,the speed of operation for ®xed-point multipliers and adders has increased due to technology scaling as well more ef®cient algorithms. Sub-1nS 64-bit adders have been reported in in mm technology. Floating-point arithmetic processors have started to enjoy similar performance improvements even without full custom implementations. As a result,system SNR requirements and available hardware are the main driving factors in the implementation of the arithmetic units for FFT engines rather than delay considerations.

C.3.2 Memory Subsystem

One of the more signi®cant considerations in the design of a specialized highperformance DSP engine is its memory subsystem architecture. As the processing engine becomes capable of more speedy operations and as the sophistication of DSP algorithms increases,there is a greater need for fast, ef®cient memory-access algorithms independent of the advances in DRAM and SRAM technologies. The goal in most such architectures in general and FFT speci®cally is to satisfy the system bandwidth requirements while occupying the least amount of space. In addition,due to intensive pipelining and the emphasis on high throughput,decoupling of fetch hardware and execute engine is dictated. This allows a continuous stream of data ¯owing into the compute engine and similarly at the output for further processing and possibly transmission and receipt of data.

In an FFT-speci®c engine,the basic design of the memory subsystem is in¯uenced by the chosen radix,the number and types of memory elements,and the type of storage. In-place storage is used almost universally to minimize area in applications where it is critical.
FFT Address Storage (FAST). In this section the “FAST” addressing scheme is introduced,which optimizes storage for an in-place FFT algorithm in any radix. Given the number of data points (N) and chosen radix (r),this scheme determines the optimum storage location for each data point in an r-bank memory system.

Addressing for FFT Modes . The N data points can be organized into r banks for a radixr implementation with N/r data points residing in each data bank. Each data point is assigned a bank number (B),and an address (A) which determine its location within each bank. This assignment needs to be done such that the following conditions are met at all times:

1. Only the locations being read in the current FFT cycle are overwritten.
2. The destination locations are determined such that the data points required for all subsequent butter¯y stages reside in different memory banks.

The second requirement poses more restrictions on the address generation hardware. The following algorithm computes the value of bank (B) and address (A) from each data point index (i);

1. Express the index in radixr notation:
i ˆ i…r À 1†r^…rÀ1†‡ÁÁÁ‡ i…3†r³ ‡ i…2†r² ‡ i…1†r ‡ i…0†…C:5†
2. Compute the value of the bank (B) according to the following equation:
Bˆ…i…r À 1†‡ … ‡ i…3†‡ i…2†‡ i…1†‡ i…0†† modulo r …C:6†
3. Compute the address location (A) for each data point according to
A ˆ i modulo…N=r†…C:7†

These equations yield the optimum address and bank assignments for any radix(r),provided that the system is capable of providing r-banks of memory per required storage device. This is suf®cient to provide contention-free in-place storage assignments for all FFT stages; there would be logr N stages for a radixr implementation of FFT. Each stage of the FFT algorithm requires fetching a different set of inputs that will be written back in-place once the computation on each data point is completed. This assignment remains unchanged from stage to stage while preserving the contention-free nature of the algorithm. Table C.1 shows the bank and address assignment for the ®rst nine entries for a 256-point FFT implemented in radix-4,using the assignment scheme above.

Addressing for Pre- / Postprocessing. To reduce computational complexity,the same engine can be used to process FFT and IFFT data. Often,system
TABLE C.1 Memory Assignmentfor a 256-PointFFT in Radix-4
Data-Point Radix-4 Digits Memory Indexii(3) i(2) i(1) i(0) Bank/Address

0 0000 0/0
1 0001 1/1
2 0002 2/2
3 0003 3/3
4 0010 1/4
5 0011 2/5
6 0012 3/6
7 0013 0/7
8 0020 2/8

requirements dictate and greatly bene®t from this reduced cost. An example of such a system would be a communication device that is receiving and transmitting information in time-multiplexed fashion. Since the inputs to the FFT and the outputs from the IFFT are “real,” a signi®cant reduction in computational resources can be achieved. Instead of zeroing out the imaginary part,we can pre/postprocess the data to enable it to use an engine that is half the size of what would be needed otherwise. Preprocessing is an additional stage before entering the IFFT mode of the engine in which data are prepared. Similarly,postprocessing is an additional stage after the FFT mode of the engine.

Preprocessing modi®es the input vector so that when an IFFT is performed the real part of the output vector contains the even time samples and the imaginary part of the output contains the odd time samples. For an effective 512-point IFFT function,the preprocessing function is described by the following equation:

Y‰iŠˆf‰X…i†‡ X^Ã…256 À i†Š ‡ j‰X…i†À X^Ã…256 À i†ŠW^…256Ài†g^Ã …C:8†

The input to the FFT engine is a 256-point complex vector that is formed from the real 512-point data vector. The real part of the vector is comprised of the even sample points while the odd samples make up the imaginary part of the vector. Postprocessing is done as the ®nal stage according to the following equation:

Y‰iŠˆ¹ ‰X…i†‡ X^Ã…256 À i†Š À … j=2†‰X…i†À X^Ã…256 À i†ŠWⁱ …C:9†₂
These two modes have special requirements in terms of storage and addressing modes.

As can be observed from (C.8) and (C.9) for pre/postprocessing,regardless of the radix used in the computational engine,only two data points are operated on at any instant. This is in line with the radix-2 FFT as only two inputs are fetched and operated on at any given time. In order to reduce the number of cycles

TABLE C.2 Data Ordering/Digit Reversal in Radix-2 and Radix-4 FFT Data Radix-2 Radix-2 Radix-4 Radix-4 Point Input Output Input Output

0 0000 0000 00 00
1 0001 1000 01 10
2 0010 0100 02 20
3 0011 1100 03 30
4 0100 0010 10 01
5 0101 1010 11 11
6 0110 0110 12 21
7 0111 1110 13 31

required to perform these two operations and to take full advantage of existing hardware in cases of non-radix-2 FFT architecture,the addressing scheme is modi®ed to pull and operate on “r” data-points. For a multiple of 2 such as a radix-4 case,this translates to keeping a dual set of symmetric addresses from top and bottom,and fetching four points at a time. The table is skipped due to simplicity of derivation.

C.3.3 Scrambling and Unscrambling of Data

Due to the inherent nature of the FFT algorithm,and depending on whether DIT or DIF is chosen,input data points or outputs will be in bit reversed order in radix-2 computation. As a result,coef®cient entry into the FFT engine needs to be adjusted accordingly. Bit reversal in a radix-2 case can be generalized to “digit reversal” in any radixr architectures. Many attempts have been made to study the feasibility of implementing this type of address generation in software, but due to the long delays of software implementations,the consensus has been to add the hardware to the system. Hardware implementation can be greatly simpli®ed if the reordering of data is properly designed into the memory access blocks. An example of the duality between radix-2 and radix-4 data input and output orderings is given in Table C.2.

C.3.4 Twiddle Factor Generation

One of the hardware implementation considerations for an FFT engine is twiddle factor generation. Hardware requirements can be greatly reduced by exploiting the symmetry properties of the twiddle factor. As depicted in Figure C.5,only the values in region H need be stored; the rest of the required twiddle factors can be generated simply by inversion and transposition of the stored values. Table C.3 shows the twiddle factors for each point on the plane based on stored values for the real (R) and imaginary (I) values of the corresponding point in region H. Hence the minimum amount of read-only-memory required for twiddle factor storage is reduced to N/8.

Figure C.5 Twiddle factors for N-Point FFT in the complex plane; only region marked as “H” is stored in ROM.

TABLE C.3 Twiddle Factor Generation
Region Real[Twiddle Factor] Imaginary[ Twiddle Factor]

AR -I B-I R CI R D-R -I E-R I FI -R G-I -R HR I

The points on this two-dimensional complex space depict values for W in counterclockwise direction,where
W ˆ e^j2k=N for k ˆ 0;… ; N À 1 …C:10†
C.4 A REPRESENTATIVE FFT ENGINE IMPLEMENTATION

In this section,some of the design considerations for a representative FFT/IFFT engine are presented. The architecture is applicable to many different applications, including high-speed digital modems such as xDSL. In this implementation we including high-speed digital modems such as xDSL. In this implementation we point real-to-complex fast fourier transform in the FFT mode,and in the IFFT mode,it performs a 512-point complex-to-real inverse fourier transform.

C.4.1 Data Format
The data format should achieve the following objectives: 1. Maximize dynamic range of represented numbers. 2. Minimize representation error (maximize precision).
Figure C.6 Data format bit allocation.
3. Minimize storage area. 4. Facilitate FFT computations.

Signed-magnitude number representation was chosen to facilitate some of the FFT computational steps,and ¯oating-point number representation was chosen for highest dynamic range and reduced storage area. Each data point is represented as a complex,sign-magnitude,¯oating-point number with one unsigned (implicitly negative) exponent that,in order to save memory space,is shared between the real and imaginary parts.² The mantissa is a fully fractional quantity,and the whole complex number is always less than 1,with 1 being used to indicate over¯ow conditions. That is,

DPˆ…real ‡ j Ã imaginary†Ã 2^Àexp …C:11†

Figure C.6 shows the data format comprising u real bits, u imaginary bits, exponent bits,and one sign bit. The dynamic range of the exponent is L ˆ 2^v,so that

Magnitude of smallest represented number (precision) 2^{À…L‡u†} Magnitude of largest represented number 2⁰ Range of biased exponent [0,2

Anh Đức Blog

Thứ Sáu, 30 tháng 8, 2013

Phu luc C.html

APPENDIXC

EFFICIENT HARDWARE IMPLEMENTATIONS OF FFT ENGINES

Không có nhận xét nào:

Đăng nhận xét