HVM-Core: Compiling Python and Haskell to Run Anything on GPUs

November 29, 2023

Introduction

The Higher-Order Virtual Machine (HVM) was designed to pioneer massively parallel computation through a functional runtime and its Interaction Net computational model, which is essential for the parallelism required by modern computational tasks. The need for a more focused approach to utilize the full potential of modern hardware, specifically Graphics Processing Units (GPUs), led to the creation of HVM-Core. HVM-Core is a low-level compile target optimized to leverage HVM’s capabilities on GPUs.

HVM-Core was developed with the goal of compiling high-level languages, such as Python and Haskell, to CUDA - NVIDIA's parallel computing platform and API model. This move is a significant step towards harnessing the computational power of GPUs, as CUDA acts as a bridge that translates high-level language code into a format suitable for efficient GPU execution. With CUDA, HVM-Core opens up the possibility for seamless execution of computational tasks coded in high-level languages on GPUs, tapping into unprecedented levels of parallelism and computational speed beyond what CPUs offer.

Why Compile Higher Order Languages for Execution on GPU?

HVM initially garnered attention for its innovative use of the Interaction Net computational model, which played a pivotal role in achieving parallelism, especially for higher-order computations. Despite its initial successes and the promising potential it showcased, as the computational landscape continued to evolve, it became apparent that the full spectrum of computational power, especially that of Graphics Processing Units (GPUs), remained untapped.

The transition towards leveraging GPU computing emanated from the recognition of the immense computational capacity that GPUs offered. Unlike traditional Central Processing Units (CPUs), GPUs hold a significant advantage in handling multiple computations simultaneously, thanks to their numerous cores designed for parallel processing.

However, the challenge lay in devising a mechanism that could seamlessly translate high-level language code into a format conducive for GPU execution. This need for a bridge between high-level languages and GPU-computing capabilities was the catalyst for the evolution towards HVM-Core.

HVM-Core is therefore building a refined, low-level compile target that is being crafted to address the need for optimizing and transitioning code from high-level languages like Python and Haskell to be executable on GPUs. By setting its sights on CUDA, a parallel computing platform and programming model by NVIDIA, HVM-Core was engineered to serve as a conduit for translating high-level language code into GPU-executable format.

How HVM-Core Works

At its core, HVM-Core acts as a low-level compile target, meticulously transposing the syntax and semantics of these high-level languages into an intermediary representation that is amenable for further compilation.

The subsequent phase of the process involves compiling this intermediary representation down to CUDA, a parallel computing platform and application programming interface model created by NVIDIA. The CUDA platform is a conduit through which HVM-Core can tap into the parallel computational capabilities of GPUs, transmuting abstract code into tangible, high-speed computations.

A cornerstone of HVM-Core's architecture is its use of Interaction Nets—a graphical, rewriting-based computational model.

‍

This model encapsulates the essence of computations as local rewriting rules, making it conducive for parallel execution.

When paired with the massively parallel architecture of GPUs, Interaction Nets offer a fertile ground for executing numerous computational tasks concurrently.

Through the lens of Interaction Nets, HVM-Core dissects high-level language code into a format that is inherently suited for parallelization. This transformation is a linchpin, enabling the seamless execution of Python and Haskell code on a landscape that is vastly parallel, thus unlocking computational speed-ups that were hitherto elusive.

In a nutshell, HVM-Core blends high-level language constructs, Interaction Nets, and GPU-based parallel computation converge. Through a two-step compilation process, it transposes high-level code into CUDA, ensuring that the vast computational resources housed within GPUs are leveraged effectively.

Performance Leap with HVM-Core

HVM-Core significantly improves computational speed, going from 30 million rewrites per second on traditional CPU architectures to about 6800 million rewrites per second on NVIDIA's RTX 4090 GPU. This isn't just a small step; it's a massive jump that highlights the potential of modern GPU architectures.

Current frameworks have made progress with CPU-based computations, but the GPU domain remains largely untapped, especially for high-level languages like Python and Haskell. HVM-Core addresses this gap, offering GPU computations for these languages with unmatched efficiency.

‍

The performance jump seen in HVM-Core comes down to two main factors: thorough optimization and the fast computation GPUs provide. HVM-Core's architecture is specifically designed to work seamlessly with the parallel nature of GPUs, ensuring that all computational resources are used to their fullest extent. This, combined with the raw computing power of modern GPUs like the NVIDIA RTX 4090, results in the impressive performance metrics that HVM-Core is known for.

Additionally, HVM-Core's computational model fully leverages the parallel capabilities of GPUs. By shifting from a CPU-centric approach to one that's optimized for GPUs, HVM-Core didn't just adapt—it maximized its performance to levels previously out of reach.

Getting Started with HVM-Core

Before diving into the applications and advantages of HVM-Core, it's essential to understand how to install and run it on your machine. Follow the steps below to install HVM-Core and execute a program:

Execute the following command to install HVM-Core:


cargo install hvm-core

Running HVM-Core:

To run in GPU, use the following command, replacing file.hvmc with the path to your HVM-Core program file:


hvmc bend file.hvmc -s

(Note: bend hasn’t been implemented as of writing of this article; if you're here today and want to run it on the GPU, you'll need to manually edit hvm2.cu using the hvmc gen-cuda-book file.hvmc command. Compile with -arch=compute_89.)

Now that HVM-Core is set up on your machine, you can start exploring its capabilities by running programs written in high-level languages like Python and Haskell, and experience the computational power of GPU execution.

HVM-Core provides a textual syntax that allows us to onstruct interaction combinator nets using a simple AST. Here’s an example of that:

<TERM> ::=

<ERA> ::= "*"

<CON> ::= "(" <TERM> " " <TERM> ")"

<DUP> ::= "[" <TERM> " " <TERM> "]"

<CTR> ::= "{" <label> " " <TERM> " " <TERM> "}"

<OP2> ::= "<" <TERM> " " <TERM> ">"

<NET> ::=

<BOOK> ::=

Here’s a reference to what these terms mean:

ERA: an eraser node, as defined on the reference system.

CON: a constructor node, as defined on the reference system.

DUP: a duplicator node, as defined on the reference system.

CTR: an "extra" node, which behaves exactly like CON/DUP nodes, but with a different symbol. When the label is 0/1, it corresponds to a CON/DUP node.

VAR: a named variable, used to create a wire. Each name must occur twice, denoting both endpoints of a wire.

REF: a reference to a top-level definition, which is itself a closed net. That reference is unrolled lazily, allowing for recursive functions to be implemented without the need for Church numerals and the like.

U24: an unboxed 24-bit unsigned integer.

OP2: a binary operation on u24 operands.

MAT: a pattern-matching operator on u24 values.

Evaluators

HVMC is equipped with two evaluators: a reference interpreter in Rust, and a massively parallel runtime in CUDA.

Both evaluators are completely eager, reducing every active pair (redex) in a highly aggressive manner. To facilitate recursion, it is recommended to pre-compile the source language into a super combinator formulation. This allows sub-expressions to unfold lazily, preventing HVMC from infinitely expanding recursive function bodies.

The eager evaluator functions by maintaining a vector of current active pairs (redexes) and performing an ‘interaction’ for each redex.

In the single-core version, this ‘for-each’ operation is executed sequentially. In the multi-core version, the vector of redexes is divided into a grid of ‘redex bags’, with each bag owned by a ‘rewrite squad.’ This squad consistently pops and executes a redex in parallel.

Since interaction rules are symmetric on the 4 surrounding ports, four threads are utilized to conduct an interaction, hence the term ‘squad’.

Memory Layout

HVM-Core's memory layout is designed for optimal efficiency and mirrors the textual syntax. Nets are stored as a vector of trees, with the 'redex' buffer holding the tree roots as active pairs, and the 'nodes' buffer containing all the nodes. Every node comprises two 32-bit pointers, totaling 64 bits. The pointers consist of a 4-bit tag and a 28-bit value, enabling the addressing of a 2 GB space for each instance.

// A pointer is a 32-bit word

type Ptr = u32;

‍

// A node stores its two aux ports

struct Node {

p1: Ptr, // this node's fst aux port

p2: Ptr, // this node's snd aux port

}

‍

// A redex links two main ports

struct Redex {

a: Ptr, // main port of node A

b: Ptr, // main port of node B

}

‍

// A closed net

struct Net {

root: Ptr, // a free wire

rdex: Vec<Redex> // a vector of redexes

heap: Vec<Node> // a vector of nodes

}

Also, there are 16 pointer types:

‍

VR1 = 0x0; // variable to aux port 1

VR2 = 0x1; // variable to aux port 2

RD1 = 0x2; // redirect to aux port 1

RD2 = 0x3; // redirect to aux port 2

REF = 0x4; // lazy closed net

ERA = 0x5; // unboxed eraser

NUM = 0x6; // unboxed number

OP2 = 0x7; // numeric operation (binary)

OP1 = 0x8; // numeric operation (unary)

ITE = 0x9; // numeric if-then-else

CT0 = 0xA; // main port of con

CT1 = 0xB; // main port of dup

CT2 = 0xC; // main port of extra node

CT3 = 0xD; // main port of extra node

CT4 = 0xE; // main port of extra node

CT5 = 0xF; // main port of extra node

To learn more about the exact implementation details - head to the HVM-Core documentation.

Example

To better understand the potential and the operational mechanism of HVM-Core, let's delve into some code snippets derived from the examples provided. These snippets depict how sorting algorithms are structured in HVM, and a simple installation and expression run in HVM.

Bubble Sort in HVM:


// sort : List -> List
(Sort Nil)         = Nil
(Sort (Cons x xs)) = (Insert x (Sort xs))

// Insert : U60 -> List -> List
(Insert v Nil)         = (Cons v Nil)
(Insert v (Cons x xs)) = (SwapGT (> v x) v x xs)

HVM-Core provides a raw syntax for defining nets, a reference implementation in Rust, and a massively parallel evaluator in CUDA.

In tests, HVM-Core performed up to 6.8 billion interactions per second on an RTX 4090. For evaluating functional programs, it was five times faster than the top alternatives, proving itself as an optimal compile target for high-level languages that require massive parallelism.

To harness HVM-Core in Python, one would need to write a Python module that interfaces with HVM-Core. However, there are no readily available Python modules that interface with HVM-Core at the moment. This is expected to emerge in the near future.

HVM-Lang

Work is now underway on HVM-Lang, which serves as an Intermediate Representation for HVM-Core, offering a higher-level syntax to facilitate programming based on the Interaction-Calculus.

Follow these steps to test that out –

git clone git@github.com:HigherOrderCO/hvm-lang.git

cd hvm-lang

cargo install --path .

This is a basic program that adds:

main = (+ 3 2)

To run a program using hvm-lang you can use the argument run, like below:

hvm-lang run <file>

The currently available operations are: +, -, , /, %, ==, !=, <, >, &, |, ^, ~, <<, >>.

Operations can handle just 2 terms at once for the moment. Example:

val1 = (+ (+ 10 3) ( 5 6))

Learn more about the syntax and details here.

Underlying Mechanics

HVM-Core functions as a bridge that allows Python and Haskell to access the computational power of GPUs. The workflow starts with HVM-Core serving as a low-level compilation target, where code from high-level languages is compiled into an intermediate representation. This intermediate form is then compiled to CUDA, NVIDIA's parallel computing platform and API model. The key to this process is the Interaction Net format, which is crucial for tapping into the parallel processing capabilities of GPUs. This format manages the organized execution of calculations in parallel, ensuring the GPU's capacity is utilized effectively.

Testing on E2E Cloud

To test out HVM-Core and HVM-Lang, head to E2E Cloud and launch a GPU node. With the access to GPU, you would be able to test out and see the performance improvements with some of the basic examples that have started emerging.

Conclusion

HVM-Core can bridge high-level languages like Python and Haskell with the powerful computation of GPUs, simplifying the complex task of leveraging GPU resources and unlocking new computational possibilities.

Expect more developments to unfold on this front in the near future.