A thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Electrical and Electronic Engineering
Explore ThesisThe demand for high-speed and energy-efficient arithmetic circuits is rapidly growing in modern digital systems, particularly in domains such as image processing, machine learning and signal processing, where exact computation is not always essential. This thesis presents the design and implementation of an approximate 8×8 signed multiplier using FPGA technology, aiming to strike an optimal balance between accuracy, performance and hardware efficiency.
The proposed design features a three-stage architecture: the first stage employs Radix-4 Booth encoding to reduce the number of partial product rows from eight to four, minimizing hardware complexity and enhancing speed; the second stage utilizes the FPGA's carry chain to compress the top two rows into one, effectively reducing the partial products to three rows; and in the final stage, FPGA's LUT6_2 primitives are used column-wise to directly compute the final product, eliminating the need for a traditional multi-level adder tree.
The design is described in VHDL and synthesized on an FPGA platform. Simulation and synthesis results reveal significant improvements in area utilization and propagation delay, with acceptable accuracy for approximate computing applications. Notably, the proposed multiplier achieves a maximum error of 3 and an average error of approximately 1.5, making it well-suited for error-tolerant applications in resource-constrained environments.
Overview of multiplication in digital systems, motivation for approximate multipliers, importance of signed multiplication, and role of FPGA in arithmetic design.
Review of previous work on exact and approximate multiplier design, signed arithmetic architectures and FPGA-based optimization.
Details of modern FPGA architectures, configurable logic blocks, look-up tables (LUTs), CARRY4, and synthesis techniques.
Design flow overview, partial product generation using Radix-4 encoding, carry chain-based compression, and LUT6_2-based final result calculation.
Hardware performance results, error analysis, Vivado simulation results, and analysis of accuracy trade-off.
Summary of findings, conclusions, and future research directions for extending the design.
Lowest LUT usage compared to existing designs (Nagar: 74, Ullah: 54, Rehman: 94)
Competitive delay performance while maintaining low resource utilization
Bounded and predictable error profile suitable for error-tolerant applications
Low average error making it suitable for image processing and ML applications
Reduces partial product rows using Radix-4 Booth encoding with Baugh-Wooley's algorithm
Uses FPGA-native CARRY4 to compress partial product rows
Employs LUT6_2 primitives for column-wise final product computation with approximation in LSBs