Douglas Walls' Weblog

« Finding a dynamic... | Main | Managed strings »
20060313 Monday March 13, 2006

Decimal Floating Point Types
I'm off to a meeting of the ISO/SC22/WG14, the C programming language committee meeting in a couple of weeks.  One of the papers on the agenda (N1154) is a proposal for a Technical Report on adding Decimal Floating Point types and arithmetic to the C programming language specification.  The proposal is based on a model of decimal arithmetic which is a formalization of the decimal system of numeration (Algorism) as further defined and constrained by, IEEE-854, ANSI X3-274, and the proposed revision of IEEE-754 (known as IEEE-754R).

The proposal adds decimal floating point within the type hierarchy, as base types, real types and arithmetic types.  The three types are called:

There is a new macro an implementation must define to indicate conformance to this technical report:
The proposal introduces generic floating types the existing floating point types: float, double, and long double.  Together the generic floating point types and decimal floating types are known as the real floating types.

The three decimal encoding formats defined in IEEE-754R correspond to the three decimal floating types as follows:
The details of the format are give in IEEE-754R.

New macros similar to those of <float.h> are defined in a new header <decfloat.h>.  For example, DEC_EVAL_METHOD, DEC32_MANT_DIG, DEC64_MANT_DIG, DEC128_MANT_DIG.  Prefixes of DEC32_, DEC64_, and DEC128_ are used to denote the types _Decimal32, _Decimal64, and _Decimal128 respectively.

Conversion from decimal floating type to integer is as you would expect, the fractional part is discarded (value truncation towards zero).  If the value cannot be represented by the integer type the result depends on the sign of the integer type.  If unsigned, and the result is positive, the largest representable number, otherwise 0.  If signed, the result it the most negative or positive number according to the sign of the floating point number.

For conversion from integer to decimal floating type, if the value being converted can be represented exactly, it is unchanged.  If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is correctly rounded.  If the value being converted is outside the range of values that can be represented, the result is positive or negative infinity depending on the sign of the value being converted, and the “overflow” floating-point exception will be raised.

For conversion between generic floating types and decimal floating types, the TR is similar to the existing ones for float, double and long double, except that when the result cannot be represented exactly, the behavior is tightened to become correctly rounded.

The TR does not add complex or imaginary decimal floating types.  However, it does add the equivalent rules for conversion between complex and imaginary types to decimal floating types as exist for conversion between generic floating types.

Determining the common type for mixed operations between decimal and other real types is difficult because ranges overlap, therefore mixed mode operations are not allowed and the use of explicit casts are required. Implicit conversions are allowed only for simple assignment and in argument passing.

There is no default argument promotion specified for the decimal floating types.

The new suffixes to denote decimal floating constants are: DF for _Decimal32, DD for _Decimal64, and DL for _Decimal128.

It would help usability if unsuffixed floating constant can be used to initialize decimal floating types.  For, example, 0.1 has type double and in implementations where FLT_EVAL_METHOD is not -1, the internal representation of 0.1 is not exact. This defeats the purpose of decimal floating types.  So the proposal introduce a translation time data type (TTDT) which the translator uses as the type for unsuffixed floating constants.  An unsuffixed floating constant is kept as a TTDT until an operation requires it to be converted to an actual type.  The value of the constant remains exact for as long as possible during the translation process.

The concept can be summarized as follows:
Examples:

double f;
f = 0.1;

Suppose the implementation uses _Decimal128 as the TTDT. 0.1 is represented exactly after the constant is scanned. It is then converted to double in the assignment operator.

f = 0.1 * 0.3;

Here, both 0.1 and 0.3 are represented in TTDT.  If the compiler evaluates the expression during translation time, it would be done using TTDT, and the result would be TTDT.  This is then converted to double before the assignment.  If the compiler generates code to evaluate the expression during execution time, both 0.1 and 0.3 would be converted to double before the multiply.  The result of the former would be different but more precise than the latter.

float g = 0.3f;
f = 0.1 * g;

When one operand is a TTDT and the other is one of float/double/long double, the TTDT is converted to double with an internal representation following the specification of FLT_EVAL_METHOD for constant of type double.  Usual arithmetic conversion is then applied to the resulting operands.

_Decimal32 h = 0.1;

If one operand is a TTDT and the other a decimal floating type, the TTDT is converted to _Decimal64 with an internal representation specified by DEC_EVAL_METHOD. Usual arithmetic conversion is then applied.

If one operand is a TTDT and the other a decimal floating type, the TTDT is converted to the decimal floating type.

The floating-point environment <fenv.h> specified in C99 applies also to decimal float types.  The decimal floating-point arithmetic specified is more stringent.  All the rounding directions and flags are supported.

Certain algorithms stipulate a precision on the result of an operation; and this precision could be different from those of the three standard types.  The technical report adds a pragma directive to control this during translation time.

#pragma STDC DEC_MAX_PRECISION integer | DEFAULT

A host of new functions are added to <math.h> to support the new decimal floating types, along with new macros HUGE_VAL_D32, HUGE_VAL_D64, HUGE_VAL_D128, DEC_INFINITY and DEC_NAN are defined to help using these functions. The functions are equivalent to the existing generic floating type functions with d32, d64, and d128 suffixes added for the decimal floating type versions of the functions.  Similarly equivalent functions to support decimal floating types are added to <stdlib.h>, <wchar.h>, and macros to <tgmath.h>.

And last New quantize functions are added to <math.h>  These functions set the exponent of argument x to the exponent of argument y, while attempting to keep the value the same.

For a look at the full document and the rational see:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1154.pdf

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1161.pdf



( Mar 13 2006, 11:50:45 AM PST ) Permalink Comments [0]

Trackback URL: http://blogs.sun.com/dew/entry/decimal_floating_point_types
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Search

Calendar

Links

Navigation

Referers