Rules for Arithmetic Operations

Fixed-point arithmetic refers to how signed or unsigned binary words are operated on. The simplicity of fixed-point arithmetic functions such as addition and subtraction allows for cost-effective hardware implementations.

The sections that follow describe the rules that the Simulink^® software follows when arithmetic operations are performed on inputs and parameters. These rules are organized into four groups based on the operations involved: addition and subtraction, multiplication, division, and shifts. For each of these four groups, the rules for performing the specified operation are presented with an example using the rules.

Computational Units

The core architecture of many processors contains several computational units including arithmetic logic units (ALUs), multiply and accumulate units (MACs), and shifters. These computational units process the binary data directly and provide support for arithmetic computations of varying precision. The ALU performs a standard set of arithmetic and logic operations as well as division. The MAC performs multiply, multiply/add, and multiply/subtract operations. The shifter performs logical and arithmetic shifts, normalization, denormalization, and other operations.

Addition and Subtraction

Addition is the most common arithmetic operation a processor performs. When two n-bit numbers are added together, it is always possible to produce a result with n + 1 nonzero digits due to a carry from the leftmost digit. For two's complement addition of two numbers, there are three cases to consider:

If both numbers are positive and the result of their addition has a sign bit of 1, then overflow has occurred; otherwise, the result is correct.
If both numbers are negative and the sign of the result is 0, then overflow has occurred; otherwise, the result is correct.
If the numbers are of unlike sign, overflow cannot occur and the result is always correct.

Fixed-Point Simulink Blocks Summation Process

Consider the summation of two numbers. Ideally, the real-world values obey the equation

$V_{a} = \pm V_{b} \pm V_{c},$

where V_b and V_c are the input values and V_a is the output value. To see how the summation is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling, Range, and Precision:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

The equation in Addition gives the solution of the resulting equation for the stored integer, Q_a. Using shorthand notation, that equation becomes

$Q_{a} = \pm F_{s b} 2^{E_{b} - E_{a}} Q_{b} \pm F_{s c} 2^{E_{c} - E_{a}} Q_{c} + B_{n e t},$

where F_sb and F_sc are the adjusted fractional slopes and B_net is the net bias. The offline conversions and online conversions and operations are discussed below.

Offline Conversions. F_sb, F_sc, and B_net are computed offline using round-to-nearest and saturation. Furthermore, B_net is stored using the output data type.

Online Conversions and Operations. The remaining operations are performed online by the fixed-point processor, and depend on the slopes and biases for the input and output data types. The worst (most inefficient) case occurs when the slopes and biases are mismatched. The worst-case conversions and operations are given by these steps:

The initial value for Q_a is given by the net bias, B_net:
$Q_{a} = B_{n e t} .$
The first input integer value, Q_b, is multiplied by the adjusted slope, F_sb:
$Q_{R a w P r o d u c t} = F_{s b} Q_{b} .$
The previous product is converted to the modified output data type where the slope is one and the bias is zero:
$Q_{T e m p} = c o n v e r t (Q_{R a w P r o d u c t}) .$
This conversion includes any necessary bit shifting, rounding, or overflow handling.
The summation operation is performed:
$Q_{a} = Q_{a} \pm Q_{T e m p} .$
This summation includes any necessary overflow handling.
Steps 2 to 4 are repeated for every number to be summed.

It is important to note that bit shifting, rounding, and overflow handling are applied to the intermediate steps (3 and 4) and not to the overall sum.

For more information, see The Summation Process.

Streamlining Simulations and Generated Code

If the scaling of the input and output signals is matched, the number of summation operations is reduced from the worst (most inefficient) case. For example, when an input has the same fractional slope as the output, step 2 reduces to multiplication by one and can be eliminated. Trivial steps in the summation process are eliminated for both simulation and code generation. Exclusive use of binary-point-only scaling for both input signals and output signals is a common way to eliminate mismatched slopes and biases, and results in the most efficient simulations and generated code.

Multiplication

The multiplication of an n-bit binary number with an m-bit binary number results in a product that is up to m + n bits in length for both signed and unsigned words. Most processors perform n-bit by n-bit multiplication and produce a 2n-bit result (double bits) assuming there is no overflow condition.

Fixed-Point Simulink Blocks Multiplication Process

Consider the multiplication of two numbers. Ideally, the real-world values obey the equation

$V_{a} = V_{b} V_{c} .$

where V_b and V_c are the input values and V_a is the output value. To see how the multiplication is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling, Range, and Precision:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

The solution of the resulting equation for the output stored integer, Q_a, is given below:

$\begin{matrix} Q_{a} = \frac{F_{b} F_{c}}{F_{a}} 2^{E_{b} + E_{c} - E_{a}} Q_{b} Q_{c} + \frac{F_{b} B_{c}}{F_{a}} 2^{E_{b} - E_{a}} Q_{b} \\ + \frac{F_{c} B_{b}}{F_{a}} 2^{E_{c} - E_{a}} Q_{c} + \frac{B_{b} B_{c} - B_{a}}{F_{a}} 2^{- E_{a}} . \end{matrix}$

Multiplication with Nonzero Biases and Mismatched Fractional Slopes. The worst-case implementation of the above equation occurs when the slopes and biases of the input and output signals are mismatched. In such cases, several low-level integer operations are required to carry out the high-level multiplication (or division). Implementation choices made about these low-level computations can affect the computational efficiency, rounding errors, and overflow.

In Simulink blocks, the actual multiplication or division operation is always performed on fixed-point variables that have zero biases. If an input has nonzero bias, it is converted to a representation that has binary-point-only scaling before the operation. If the result is to have nonzero bias, the operation is first performed with temporary variables that have binary-point-only scaling. The result is then converted to the data type and scaling of the final output.

If both the inputs and the output have nonzero biases, then the operation is broken down as follows:

$\begin{array}{l} V_{1 T e m p} = V_{1}, \\ V_{2 T e m p} = V_{2}, \\ V_{3 T e m p} = V_{1 T e m p} V_{2 T e m p}, \\ V_{3} = V_{3 T e m p}, \end{array}$

where

$\begin{array}{l} V_{1 T e m p} = 2^{E_{1 T e m p}} Q_{1 T e m p}, \\ V_{2 T e m p} = 2^{E_{2 T e m p}} Q_{2 T e m p}, \\ V_{3 T e m p} = 2^{E_{3 T e m p}} Q_{3 T e m p} . \end{array}$

These equations show that the temporary variables have binary-point-only scaling. However, the equations do not indicate the signedness, word lengths, or values of the fixed exponent of these variables. The Simulink software assigns these properties to the temporary variables based on the following goals:

Represent the original value without overflow.
The data type and scaling of the original value define a maximum and minimum real-world value:

$V_{M a x} = F 2^{E} Q_{M a x I n t e g e r} + B,$

$V_{M i n} = F 2^{E} Q_{M i n I n t e g e r} + B .$
The data type and scaling of the temporary value must be able to represent this range without overflow. Precision loss is possible, but overflow is never allowed.
Use a data type that leads to efficient operations.
This goal is relative to the target that you will use for production deployment of your design. For example, suppose that you will implement the design on a 16-bit fixed-point processor that provides a 32-bit long, 16-bit int, and 8-bit short or char. For such a target, preserving efficiency means that no more than 32 bits are used, and the smaller sizes of 8 or 16 bits are used if they are sufficient to maintain precision.
Maintain precision.
Ideally, every possible value defined by the original data type and scaling is represented perfectly by the temporary variable. However, this can require more bits than is efficient. Bits are discarded, resulting in a loss of precision, to the extent required to preserve efficiency.

For example, consider the following, assuming a 16-bit microprocessor target:

$V_{O r i g i n a l} = Q_{O r i g i n a l} + - 43.25,$

where Q_Original is an 8-bit, unsigned data type. For this data type,

$\begin{matrix} Q_{M a x I n t e g e r} = 225, \\ Q_{M i n I n t e g e r} = 0, \end{matrix}$

$\begin{matrix} V_{M a x} = 211.75, \\ V_{M i n} = - 43.25. \end{matrix}$

The minimum possible value is negative, so the temporary variable must be a signed integer data type. The original variable has a slope of 1, but the bias is expressed with greater precision with two digits after the binary point. To get full precision, the fixed exponent of the temporary variable has to be -2 or less. The Simulink software selects the least possible precision, which is generally the most efficient, unless overflow issues arise. For a scaling of 2^-2, selecting signed 16-bit or signed 32-bit avoids overflow. For efficiency, the Simulink software selects the smaller choice of 16 bits. If the original variable is an input, then the equations to convert to the temporary variable are

$\begin{array}{l} \begin{matrix} uint8_T & Q_{O r i g i n a l}, \\ uint16_T & Q_{T e m p}, \end{matrix} \\ Q_{T e m p} = ((int16_T) Q_{O r i g i n a l} ≪ 2) - 173. \end{array}$

Multiplication with Zero Biases and Mismatched Fractional Slopes. When the biases are zero and the fractional slopes are mismatched, the implementation reduces to

$Q_{a} = \frac{F_{b} F_{c}}{F_{a}} 2^{E_{b} + E_{c} - E_{a}} Q_{b} Q_{c} .$

Offline Conversions

The quantity

$F_{N e t} = \frac{F_{b} F_{c}}{F_{a}}$

is calculated offline using round-to-nearest and saturation. F_Net is stored using a fixed-point data type of the form

$2^{E_{N e t}} Q_{N e t},$

where E_Net and Q_Net are selected automatically to best represent F_Net.

Online Conversions and Operations

The integer values Q_b and Q_c are multiplied:
$Q_{R a w P r o d u c t} = Q_{b} Q_{c} .$
To maintain the full precision of the product, the binary point of Q_RawProduct is given by the sum of the binary points of Q_b and Q_c.
The previous product is converted to the output data type:
$Q_{T e m p} = c o n v e r t (Q_{R a w P r o d u c t}) .$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
The multiplication
$Q_{2 R a w P r o d u c t} = Q_{T e m p} Q_{N e t}$
is performed.
The previous product is converted to the output data type:
$Q_{a} = c o n v e r t (Q_{2 R a w P r o d u c t}) .$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
Steps 1 through 4 are repeated for each additional number to be multiplied.

Multiplication with Zero Biases and Matching Fractional Slopes. When the biases are zero and the fractional slopes match, the implementation reduces to

$Q_{a} = 2^{E_{b} + E_{c} - E_{a}} Q_{b} Q_{c} .$

Offline Conversions

No offline conversions are performed.

Online Conversions and Operations

The integer values Q_b and Q_c are multiplied:
$Q_{R a w P r o d u c t} = Q_{b} Q_{c} .$
To maintain the full precision of the product, the binary point of Q_RawProduct is given by the sum of the binary points of Q_b and Q_c.
The previous product is converted to the output data type:
$Q_{a} = c o n v e r t (Q_{R a w P r o d u c t}) .$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
Steps 1 and 2 are repeated for each additional number to be multiplied.

For more information, see The Multiplication Process.

Division

This section discusses the division of quantities with zero bias.

Note

When any input to a division calculation has nonzero bias, the operations performed exactly match those for multiplication described in Multiplication with Nonzero Biases and Mismatched Fractional Slopes.

Fixed-Point Simulink Blocks Division Process

Consider the division of two numbers. Ideally, the real-world values obey the equation

$V_{a} = V_{b} / V_{c},$

where V_b and V_c are the input values and V_a is the output value. To see how the division is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling, Range, and Precision:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

For the case where the slope adjustment factors are one and the biases are zero for all signals, the solution of the resulting equation for the output stored integer, Q_a, is given by the following equation:

$Q_{a} = 2^{E_{b} - E_{c} - E_{a}} (Q_{b} / Q_{c}) .$

This equation involves an integer division and some bit shifts. If E_a > E_b–E_c, then any bit shifts are to the right and the implementation is simple. However, if E_a < E_b–E_c, then the bit shifts are to the left and the implementation can be more complicated. The essential issue is that the output has more precision than the integer division provides. To get full precision, a fractional division is needed. The C programming language provides access to integer division only for fixed-point data types. Depending on the size of the numerator, you can obtain some of the fractional bits by performing a shift prior to the integer division. In the worst case, it might be necessary to resort to repeated subtractions in software.

In general, division of values is an operation that should be avoided in fixed-point embedded systems. Division where the output has more precision than the integer division (i.e., E_a < E_b–E_c) should be used with even greater reluctance.

For more information, see The Division Process.

Shifts

Nearly all microprocessors and digital signal processors support well-defined bit-shift (or simply shift) operations for integers. For example, consider the 8-bit unsigned integer 00110101. The results of a 2-bit shift to the left and a 2-bit shift to the right are shown in the following table.

Shift Operation	Binary Value	Decimal Value
No shift (original number)	00110101	53
Shift left by 2 bits	11010100	212
Shift right by 2 bits	00001101	13

You can perform a shift using the Simulink Shift Arithmetic block. Use this block to perform a bit shift, a binary point shift, or both.

Shifting Bits to the Right

The special case of shifting bits to the right requires consideration of the treatment of the leftmost bit, which can contain sign information. A shift to the right can be classified either as a logical shift right or an arithmetic shift right. For a logical shift right, a 0 is incorporated into the most significant bit for each bit shift. For an arithmetic shift right, the most significant bit is recycled for each bit shift.

The Shift Arithmetic block performs an arithmetic shift right and, therefore, recycles the most significant bit for each bit shift right. For example, given the fixed-point number 11001.011 (-6.625), a bit shift two places to the right with the binary point unmoved yields the number 11110.010 (-1.75), as shown in the model below:

To perform a logical shift right on a signed number using the Shift Arithmetic block, use the Data Type Conversion block to cast the number as an unsigned number of equivalent length and scaling. This model shows that the fixed-point signed number 11001.001 (-6.625) becomes 00110.010 (6.25).