Skip to content

Commit 58768ab

Browse files
committed
Merge pull request #1906 from feature/bigint
256-bit big integer support This PR introduces support for big integers in AtomVM, allowing arithmetic and bitwise operations on integers up to 256-bit (sign + 255-bit magnitude). This significantly extends AtomVM's numeric capabilities beyond the previous 64-bit limitation. Core Big Integer Support: - Implemented a new big integer representation using boxed terms with sign bit - Added comprehensive big integer arithmetic through the new `intn` module - `term_is_any_integer()` returns true for big integers - Boxed integers now utilize the sign bit for efficient sign representation Arithmetic Operations: - All arithmetic operations (`+`, `-`, `*`, `div`, `rem`, `abs`, `neg`) now support integers up to 256-bit - All bitwise operations (`band`, `bor`, `bxor`, `bnot`, `bsl`, `bsr`) now support integers up to 256-bit - Float conversion functions now handle big integer conversions in both directions Serialization Support: - Added big integer support in `binary_to_term/1` and `term_to_binary/1,2` - External term format now encodes/decodes big integers as `SMALL_BIG_EXT` JIT Enhancements: - Added JIT support for big integer encoding - Implemented big integer constant support in opcodes (JIT and Emu) Overflow Checking (breaking): - `bsl` (bitshift left) now properly checks for overflow. While this shouldn't affect existing code (integers were previously limited to 64 bits), ensure values are masked before left bitshifts: e.g., `(16#FFFF band 0xF) bsl 252` Error Handling (breaking): - `binary_to_integer/1` no longer accepts binaries with whitespace or prefixes like `<<"0xFF">>` or `<<" 123">>` - `binary_to_integer` and `list_to_integer` now raise `badarg` instead of `overflow` when parsing integers exceeding 256 bits. Update error handling code accordingly Bug Fixes: - Fixed `list_to_integer` bug with integers close to `INT64_MAX` These changes are made under both the "Apache 2.0" and the "GNU Lesser General Public License 2.1 or later" license terms (dual license). SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
2 parents 2579792 + 99e48e6 commit 58768ab

39 files changed

+9565
-722
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5959
- Reimplemented `lists:keyfind`, `lists:keymember` and `lists:member` as NIFs
6060
- Added `AVM_PRINT_PROCESS_CRASH_DUMPS` option
6161
- Added `lists:ukeysort/2`
62+
- Added support for big integers up to 256-bit (sign + 256-bit magnitude)
63+
- Added support for big integers in `binary_to_term/1` and `term_to_binary/1,2`
6264

6365
### Changed
6466

@@ -69,6 +71,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6971
- Entry point now is `init:boot/1` if it exists. It starts the kernel application and calls `start/0` from the
7072
identified startup module. Users who started kernel application (typically for distribution) must no longer
7173
do it. Startint `net_kernel` is still required.
74+
- All arithmetic operations (`+`, `-`, `*`, `div`, `rem`, `abs`, etc.) now support integers up to 256-bit
75+
- All bitwise operations (`band`, `bor`, `bxor`, `bnot`, `bsl`, `bsr`) now support integers up to 256-bit
76+
- Float conversion functions now support converting to/from big integers
77+
- `bsl` now properly checks for overflow
78+
- `binary_to_integer/1` no longer accepts binaries such as `<<"0xFF">>` or `<<" 123">>`
79+
- `binary_to_integer` and `list_to_integer` do not raise anymore `overflow` error, they raise
80+
instead `badarg`.
7281

7382
### Fixed
7483

@@ -79,6 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7988
- packbeam: fix memory leak preventing building with address sanitizer
8089
- Fixed a bug where empty atom could not be created on some platforms, thus breaking receiving a message for a registered process from an OTP node.
8190
- Fix a memory leak in distribution when a BEAM node would monitor a process by name.
91+
- Fix `list_to_integer`, it was likely buggy with integers close to INT64_MAX
8292

8393
## [0.6.7] - Unreleased
8494

UPDATING.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ port socket driver, are also represented by a port and some matching code may ne
1313
`is_pid/1` to `is_port/1`.
1414
- Ports and pids can be registered. Function `globalcontext_get_registered_process` result now is
1515
a term that can be a `port()` or a `pid()`.
16+
- `bsl` (Bitshift left) now checks for overflows, this shouldn't be a practical issue for existing
17+
code, since integers were limited to 64 bits, however make sure to bitmask values before left
18+
bitshifts: e.g. `(16#FFFF band 0xF) bsl 252`.
19+
- `binary_to_integer` and `list_to_integer` do not raise `overflow` error anymore, they instead
20+
raise `badarg` when trying to parse an integer that exceeds 256 bits. Update any relevant error
21+
handling code.
1622

1723
## v0.6.4 -> v0.6.5
1824

doc/src/differences-with-beam.md

Lines changed: 104 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,110 @@ AtomVM does not implement some key features of the BEAM. Some of these limitatio
3939
worked on and this list might be outdated. Do not hesitate to check GitHub issues or contact us
4040
when in doubt.
4141

42-
### Wide precision integers
43-
44-
AtomVM currently only supports 64 bits integers. This is being worked on. However, please note
45-
that AtomVM is unlikely to support arbitrary precision integers as libraries for such support
46-
usually are quite large.
42+
### Integer precision and overflow
43+
44+
AtomVM supports integers up to 256-bit with an additional sign flag, while BEAM supports unlimited
45+
precision integers. This fundamental difference has several implications:
46+
47+
#### Integer limits
48+
49+
- **Maximum value**: `16#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF` (256
50+
ones, which equals `2^256 - 1`)
51+
- **Minimum value**: `-16#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF` (which
52+
equals `-(2^256 - 1)`)
53+
54+
Note that AtomVM does not use two's complement for big integers. The sign is stored as a separate
55+
flag, which means `INTEGER_MAX = -INTEGER_MIN`.
56+
57+
#### Overflow errors
58+
59+
Unlike BEAM, AtomVM raises `overflow` errors when integer operations exceed 256-bit capacity:
60+
61+
```erlang
62+
IntMax = 16#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,
63+
% The following will raise an overflow error on AtomVM, but succeeds on BEAM:
64+
Result = IntMax + 1 % overflow error
65+
66+
% Also applies to subtraction and multiplication:
67+
-IntMax - 1 % overflow error
68+
IntMax * 2 % overflow error
69+
```
70+
71+
Handling overflows:
72+
73+
```erlang
74+
safe_calc(MaybeOvfFun) ->
75+
try MaybeOvfFun() of
76+
I when is_integer(I) -> {ok, I}
77+
catch
78+
error:overflow -> {error, overflow}
79+
end.
80+
81+
% Returns `{ok, Result}`, Result is a 255 bit integer
82+
safe_calc(fun() -> factorial(57) end).
83+
84+
% Returns `{error, overflow}`, since 261 bit integers are not allowed
85+
safe_calc(fun() -> factorial(58) end).
86+
```
87+
88+
Overflow can also occur with:
89+
- Bit shift left operations: `1 bsl 257` raises overflow (shifting beyond the 256-bit boundary).
90+
When shifting values with multiple set bits, mask first to prevent overflow: `16#FFFF bsl 252`
91+
would overflow, but `(16#FFFF band 0xF) bsl 252` succeeds
92+
- Float to integer conversions: `ceil/1`, `round/1`, etc. when the result exceeds 256-bit
93+
94+
Note: While BEAM raises `system_limit` error for operations like
95+
`1 bsl 2000000000000000000000000000000000`, AtomVM consistently uses `overflow` error for all
96+
integer capacity violations.
97+
98+
Note: Integer literals larger than 256 bits in source code will compile successfully with
99+
Erlang/Elixir compilers, but the resulting BEAM files will fail to load on AtomVM. This also
100+
applies to compile-time constant expressions that evaluate to integers exceeding 256 bits, such as
101+
`1 bsl 300`. These expressions are evaluated by the compiler and stored as constants in the BEAM
102+
file, causing the same load-time failure. Always ensure that integer constants in your code are
103+
within AtomVM's supported range.
104+
105+
Note: The `erlang:binary_to_term/1,2` function raises a `badarg` error when attempting to
106+
deserialize binary data containing an integer larger than 256 bits. This differs from BEAM, which
107+
can deserialize integers of any size. Applications that exchange serialized terms with BEAM nodes
108+
should be aware of this limitation.
109+
110+
Note: String and binary conversion functions such as `erlang:binary_to_integer/1,2`,
111+
`erlang:list_to_integer/1,2`, and Elixir's `String.to_integer/1,2` raise a `badarg` error when the
112+
input represents an integer exceeding 256 bits. For example,
113+
`erlang:binary_to_integer(<<"10000000000000000000000000000000000000000000000000000000000000000">>, 16)`
114+
will fail with `badarg` on AtomVM, while it succeeds on BEAM. Applications parsing user input or
115+
external data should validate that numeric values fall within AtomVM's supported range.
116+
117+
#### Bitwise operations edge cases
118+
119+
The 256-bit limitation creates specific edge cases with bitwise operations that would require 257
120+
bits:
121+
122+
On BEAM (unlimited precision), returns `-IntMax - 1` (requires 257 bits):
123+
124+
```erlang
125+
1> IntMax = 16#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.
126+
115792089237316195423570985008687907853269984665640564039457584007913129639935
127+
2> integer_to_binary(-1 bxor IntMax, 16).
128+
<<"-10000000000000000000000000000000000000000000000000000000000000000">>
129+
3> integer_to_binary(bnot IntMax, 16).
130+
<<"-10000000000000000000000000000000000000000000000000000000000000000">>
131+
```
132+
133+
On AtomVM (256-bit limited), returns 0 (cannot represent 257th bit):
134+
135+
```erlang
136+
1> IntMax = 16#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.
137+
115792089237316195423570985008687907853269984665640564039457584007913129639935
138+
2> -1 bxor IntMax.
139+
0
140+
3> bnot IntMax.
141+
0
142+
```
143+
144+
This occurs because AtomVM cannot create an integer with the 257th bit set to 1 with negative sign.
145+
Since `-0` is not allowed, the result is normalized to `0`.
47146

48147
### Bit syntax
49148

doc/src/memory-management.md

Lines changed: 104 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,15 @@ loaded) a fixed size table. Management of the global atom table is outside of t
180180

181181
### Integers
182182

183-
An integer is represented as a single word, with the low-order 4 bits having the value `0xF` (`1111b`). The high order word-size-6 bits are used to represent the integer value:
183+
AtomVM supports integers up to 256 bits with an additional sign bit stored outside the numeric
184+
payload. The representation strategy depends on the integer's size and uses canonicalization to
185+
ensure each value has exactly one representation.
186+
187+
#### Immediate Integers
188+
189+
Small integers are represented as a single word, with the low-order 4 bits having the value `0xF`
190+
(`1111b`). The high order word-size-4 bits are used to represent the integer value using two's
191+
complement:
184192

185193
|< 4>|
186194
+===========================+====+
@@ -189,11 +197,13 @@ An integer is represented as a single word, with the low-order 4 bits having the
189197
| |
190198
|<---------- word-size --------->|
191199

192-
The magnitude of an integer is therefore limited to `2^{word-size - 4}` in an AtomVM program (e.g., on a 32-bit platform, `+- 134,217,728`).
200+
On 32-bit systems, immediate integers can represent signed values in the range `[-2^27, 2^27-1]` (28
201+
bits + 4-bit tag = 32 bits).
202+
On 64-bit systems, immediate integers can represent signed values in the range `[-2^59, 2^59-1]` (60
203+
bits + 4-bit tag = 64 bits).
193204

194-
```{attention}
195-
Arbitrarily large integers (bignums) are not currently supported in AtomVM.
196-
```
205+
For integers outside these ranges, AtomVM uses boxed representations (see Boxed Integers section
206+
below).
197207

198208
### nil
199209

@@ -242,6 +252,88 @@ A boxed term pointer is a single-word term that contains the address of the refe
242252

243253
Because terms (and hence the heap) are always aligned on boundaries that are divisible by the word size, the low-order 2 bits of a term address are always 0. Consequently, the high-order word-size - 2 (`1,073,741,824`, on a 32-bit platform) are sufficient to address any term address in the AtomVM address space, for 32-bit and greater machine architectures.
244254

255+
### Boxed Integers
256+
257+
AtomVM uses boxed integers for values that exceed the immediate integer range. There are two types
258+
of boxed integer representations: native integers (using int32_t or int64_t) and big integers (using
259+
arrays of uint32_t digits).
260+
261+
#### Native Boxed Integers
262+
263+
For integers that don't fit in immediate representation but can be stored in native C integer
264+
types, AtomVM uses boxed integers with two's complement encoding and a redundant sign bit in the
265+
header.
266+
267+
**On 32-bit systems:**
268+
- Integers in range `[-2^31, -2^27-1] ∪ [2^27, 2^31-1]` are stored as boxed int32_t (single word
269+
payload)
270+
- Integers in range `[-2^63, -2^31-1] ∪ [2^31, 2^63-1]` are stored as boxed int64_t (two word
271+
payload)
272+
273+
**On 64-bit systems:**
274+
- Integers in range `[-2^63, -2^59-1] ∪ [2^59, 2^63-1]` are stored as boxed int64_t (single word
275+
payload)
276+
277+
The boxed header uses:
278+
- `0x8` (`001000b`) for positive integers (TERM_BOXED_POSITIVE_INTEGER)
279+
- `0xC` (`001100b`) for negative integers (TERM_BOXED_NEGATIVE_INTEGER)
280+
281+
|< 6 >|
282+
+=========================+======+
283+
| boxed-size (1 or 2) |001X00| boxed[0] (X=0 for positive, X=1 for negative)
284+
+-------------------------+------+
285+
| native integer value | boxed[1] (int32_t or int64_t low word)
286+
+--------------------------------+
287+
| high word (if int64_t on | boxed[2] (32-bit systems only)
288+
| 32-bit system) |
289+
+================================+
290+
| |
291+
|<---------- word-size --------->|
292+
293+
#### Big Integers
294+
295+
For integers beyond the native int64_t range (up to ±(2^256 - 1)), AtomVM uses an array of uint32_t
296+
digits representing the magnitude, with the sign stored as a flag in the boxed header. These big
297+
integers do NOT use two's complement encoding.
298+
299+
The digits array:
300+
- Stores the absolute value of the integer
301+
- Uses little-endian ordering (digit[0] is least significant)
302+
- Omits leading zero digits to save space
303+
- Includes a dummy zero digit when necessary to avoid ambiguity with native boxed integers
304+
305+
|< 6 >|
306+
+=========================+======+
307+
| boxed-size (n) |001X00| boxed[0] (X=0 for positive, X=1 for negative)
308+
+-------------------------+------+
309+
| digit[0] (lsb) | boxed[1] (uint32_t)
310+
+--------------------------------+
311+
| digit[1] | boxed[2] (uint32_t)
312+
+--------------------------------+
313+
| ... | ...
314+
+--------------------------------+
315+
| digit[k-1] (msb) | boxed[k] (uint32_t)
316+
+--------------------------------+
317+
| 0 (dummy digit if needed) | boxed[n] (uint32_t)
318+
+================================+
319+
| |
320+
|<---------- word-size --------->|
321+
322+
**Canonicalization Rules:**
323+
- AtomVM ensures that integers are always stored in the most compact representation
324+
- Operations that produce results fitting in a smaller representation automatically convert to that
325+
representation
326+
- A dummy digit mechanism ensures that the smallest big integer always has more words than the
327+
largest native boxed integer. This is required when storing values such as `UINT64_MAX`
328+
(`0xFFFFFFFFFFFFFFFF`), that would require only 2 digits, but boxed-size field must allow to
329+
distinguish it from native boxed integers (such as `int64_t`)
330+
331+
**Examples:**
332+
- The value 3 is always stored as an immediate integer (never as a boxed integer)
333+
- On a 64-bit system, 2^60 would be stored as a boxed int64_t, not as a big integer
334+
- The value 2^100 would be stored as a big integer with 4 uint32_t digits (plus potentially a dummy
335+
digit)
336+
245337
### References
246338

247339
A reference (e.g., created via [`erlang:make_ref/0`](./apidocs/erlang/estdlib/erlang.md#make_ref0)) stores a 64-bit incrementing counter value (a "ref tick"). On 64 bit machines, a Reference takes up two words -- the boxed header and the 64-bit value, which of course can fit in a single word. On 32-bit platforms, the high-order 28 bits are stored in `boxed[1]`, and the low-order 32 bits are stored in `boxed[2]`:
@@ -278,7 +370,7 @@ Tuples are represented as boxed terms containing a boxed header (`boxed[0]`), a
278370

279371
### Maps
280372

281-
Maps are represented as boxed terms containing a boxed header (`boxed[0]`), a type tag of `0x3C` (`111100b`), followed by:
373+
Maps are represented as boxed terms containing a boxed header (`boxed[0]`), a type tag of `0x2C` (`101100b`), followed by:
282374

283375
* a term pointer to a tuple of arity `n` containing the keys in the map;
284376
* a sequence of `n`-many words, containing the values of the map corresponding (in order) to the keys in the reference tuple.
@@ -300,7 +392,7 @@ The keys and values are single word terms, i.e., either immediates or pointers t
300392
| ...
301393
| | |< 6 >|
302394
| +=========================+======+
303-
| | boxed-size (n) |111100| boxed[0]
395+
| | boxed-size (n) |101100| boxed[0]
304396
| +-------------------------+------+
305397
+-----------------< keys | boxed[1]
306398
+--------------------------------+
@@ -446,7 +538,7 @@ to `nil`.
446538
some
447539
binary |< 6 >|
448540
^ +=========================+======+
449-
| | boxed-size (5) |100100| boxed[0]
541+
| | boxed-size (5) |000100| boxed[0]
450542
| +-------------------------+------+
451543
| | match-or-binary-ref | boxed[1]
452544
| +--------------------------------+
@@ -464,15 +556,15 @@ A reference to a reference-counted binary counts as a reference, in which case t
464556

465557
#### Sub-Binaries
466558

467-
Sub-binaries are represented as boxed terms containing a boxed header (`boxed[0]`), a type tag of `0x28` (`001000b`)
559+
Sub-binaries are represented as boxed terms containing a boxed header (`boxed[0]`), a type tag of `0x28` (`101000b`)
468560

469561
A sub-binary is a boxed term that points to a reference-counted binary, recording the offset into the binary and the length (in bytes) of the sub-binary. An invariant for this term is that the `offset + length` is always less than or equal to the length of the referenced binary.
470562

471563
some
472564
refc
473565
binary |< 6 >|
474566
^ +=========================+======+
475-
| | boxed-size (3) |001000| boxed[0]
567+
| | boxed-size (3) |101000| boxed[0]
476568
| +-------------------------+------+
477569
| | len | boxed[1]
478570
| +--------------------------------+
@@ -630,6 +722,8 @@ A given process heap and stack occupy a single region of malloc'd memory, and it
630722

631723
Terms stored in the stack, registers, and process dictionary are either single-word terms (like atoms or pids) or term references, i.e., single-word terms that point to boxed terms or list cells in the heap. These terms constitute the "roots" of the memory graph of all "reachable" terms in the process.
632724

725+
Boxed integers, including both native boxed integers and big integers, are simple blob structures that are copied as-is during garbage collection. They do not contain any pointers or addresses that need to be updated during the garbage collection process.
726+
633727
### When does garbage collection happen?
634728

635729
Garbage collection typically occurs as the result of a request for an allocation of a multi-word term in the heap (e.g., a tuple, list, or binary, among other types), and when there is currently insufficient space in the free space between the current heap and the current stack to accommodate the allocation.

doc/src/programmers-guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Currently, AtomVM implements a strict subset of the BEAM instruction set.
1919
A high level overview of the supported language features include:
2020

2121
* All the major Erlang types, including
22-
* integers (with size limits)
22+
* integers (integers with 256-bit magnitude plus separate sign)
2323
* floats
2424
* tuples
2525
* [lists](./apidocs/erlang/estdlib/lists.md)
@@ -740,7 +740,7 @@ The following Erlang type specification enumerates this type:
740740
Erlang/OTP uses the Christian epoch to count time units from year 0 in the Gregorian calendar. The, for example, the value 0 in Gregorian seconds represents the date Jan 1, year 0, and midnight (UTC), or in Erlang terms, `{{0, 1, 1}, {0, 0, 0}}`.
741741

742742
```{attention}
743-
AtomVM is currently limited to representing integers in at most 64 bits, with one bit representing the sign bit.
743+
AtomVM is currently limited to representing time in at most 64 bits, with one bit representing the sign bit.
744744
However, even with this limitation, AtomVM is able to resolve microsecond values in the Gregorian calendar for over
745745
292,000 years, likely well past the likely lifetime of an AtomVM application (unless perhaps launched on a deep
746746
space probe).

libs/jit/include/jit.hrl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@
2020

2121
-define(JIT_FORMAT_VERSION, 1).
2222

23+
% Before adding any new platform to the list below:
24+
% Is it 64-bit big endian? if so, `put_digits` function in jit.erl must be updated to support
25+
% big endian platforms.
26+
2327
-define(JIT_ARCH_X86_64, 1).
2428
-define(JIT_ARCH_AARCH64, 2).
2529
-define(JIT_ARCH_ARMV6M, 3).

0 commit comments

Comments
 (0)