

Application Acceleration made simple

www.inaccel.com

info@inaccel.com

#### What software developers want





Source: Databricks, Apache Spark Survey 2016, Report

## **DevOps using CPUs, GPUs**



| Company            | Deploy Frequency    | Deploy Lead Time  | Reliability     | Customer<br>Responsiveness |
|--------------------|---------------------|-------------------|-----------------|----------------------------|
| Amazon             | 23,000 / day        | Minutes           | High            | High                       |
| Google             | 5,500 / day         | Minutes           | High            | High<br>High               |
| Netflix            | 500 / day           | Minutes           | High            |                            |
| Facebook           | I / day             | Hours             | High            | High                       |
| Twitter            | 3 / week            | Hours             | High            | High                       |
| Spine II           | 3 / week            | Hours             | High (Clinical) | High                       |
| Typical Enterprise | Once every 9 months | Months / Quarters | Low / Medium    | Low / Medium               |

Source: The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win is the third book by Gene Kim

#### **Deploy FPGAs on cloud**



#### > Several steps

- > Prior knowledge on FPGAs
  - >> Bitstream
  - >> Memory management
  - >> Communication
  - >> Challenges: Bitstream version, Firmware, SDK



#### How To Create an Amazon FPGA Image (AFI) From One of The CL Examples: Step-by-Step Guide

#### Fast path to running CL Examples on FPGA Instance

For developers that want to skip the development flow and start running the examples on the FPGA instance. You can skip step 1 through 3 if 3 you are not interrested in the development process. Size of 4 through 5 will show you for not one one of the producing med AFI examples the public AFIs, developers can skip the build flow steps and jump to step 4. Public AFIs are available for each example and can be found in the examples and can be found in the exampl

#### Step 1. Pick one of the examples and start in the example directory

It is recommended that you complete this step-by-step guide using HDK fiello world example. Next use this same guide to develop using the cl dram dma. When your ready, copy one of the examples provided and modify the design files, scripts and constraints directory.

\$ od SMOX DIR/CI/examples/cl hello world — \$ you can change cl hello world to cl dram dwa, cl uram gnample or cl hello world vhdl \$ export CL DIR-S(pwd)

Setting up the CL DIR environment variable is crucial as the build scripts rely on that value. Each example follows the recommended directory structure to match the expected structure for HDR simulation and build scripts.

#### Step 2. Build the C

This checklist should be consulted before you start the build process

NOTE This step requires you to have Xilimx Vivado Tools and Licenses installed

\$ vivado -mode hatch # Werify Vivado is installed.

Executing the are build dep five (1.4) script will perform the entire implementation process converting the CL design into a completed. Design Checkpoint that meets trining and placement constrains of the target FPGA. The output is a tarball file comprising the DCP file, and other log/manifest files, formatted as: Y MI DCP files Deviloper CL Ltar. This file would be submitted to AWS to create an AFI. By default the build script will use Clock Group A Recipe AD which uses a main clock of 125 MHz.

- \$ cd \$Ct\_DIM/build/scripts \$ ./aws build dop from cl.sh
- In order to use a 250 MHz main clock the developer can specify the A1 Clock Group A Recipe as in the following example
- \$ cd \$CL DIR/Build/scripts \$ \_/aws build dep from cl.sh \_clock recipe a Al
- Other clock recipes can be specified as well. More details on the Clock Group Recipes Table and how to specify different recipes can be found in the following README.

NOTE: The DCP generation can take up to several hours to complete, hence the laws suited dop from CL-Mr will run the main build process (VLwda) in within a nature context. This will allow the build to continue running even if the SSH session is terminated half way through the run

To be notified via e-mail when the build completes:

- 1. Set up notification via SNS:
- \$ pip install —user -upgrade boto3 # boto3 package is required by the notify via ses script
- \$ export EMAIL\*your.smail@example.com \$ \$AMS FPGA REPO DIR/shared/bin/scripts/motify via sms.py
- \$ SAWS FPGA REPO DIR/shared/bin/scripts/notify via sm
- 2. Check your e-mail address and confirm subscription
- 3. When calling was build dop from cl.sh, add on the -notify switch
- 4. Once your build is complete, an e-mail will be sent to you stating "Your build is done."
- 5. For each example the known warnings are documented in warnings.txt file located in the \$CL DIR/build/scripts directory of hello world

#### Step 3. Submit the Design Checkpoint to AWS to Create the AFI

To submit the DCP, create an S3 bucket for submitting the design and upload the tarball file into that bucket. You need to prepare the following information:

- 1. Name of the logic design (Optional).
- Generic description of the logic design (Optional).
- 3. Location of the tarball file object in S3.
- 4. Location of an \$3 directory where AWS would write back logs of the AFI creation.
- 5. AWS region where the AFI will be created. Use copy-fpga-image API to copy an AFI to a different region.

To upload your tarball file to S3, you can use any of the tools supported by S3.

## **Challenges on FPGAs**



> How can I deploy my FPGA accelerator easy?



## Challenges



> How can I deploy my FPGA accelerator easy?

> How can I scale-out my applications to multiple FPGAs?



## **Challenges**



> How can I deploy my FPGA accelerator easy?

> How can I scale-out my applications to multiple Alveo cards?

> How multiple users or applications can share my FPGA cluster?





#### **More Challenges**



> How can scale-out my application on-prem and on cloud?





## From single node to scalable deployment







## What software developers want





Source: Databricks, Apache Spark Survey 2016, Report

# Scalable Orchestrator for FPGA clusters ( inacce |





- Seamless invoking from C/C++, Python, Java ... </>> and Scala. No need for OpenCL
- Automatic configuration and management of the FPGA bitstreams and memory
- Seamless resource management of the FPGA cluster from multiple threads/processes/applications/users
  - Fully scalable: Scale-up (multiple FPGAs per node) and Scale-out (multiple FPGA-based servers over Spark)

#### PaaS and SaaS for FPGA clusters

inaccel

**FPGA** 

Orchestrator





**Applications** 

Data

Runtime

Middleware

**Operating System** 

Virtualization/Sharing

Servers with FPGAs

Platform as a Service

**Applications** 

Data

Runtime

Middleware

Operating System

Virtualization/Sharing

Servers with FPGAs

Software as a Service

Applications

Data

Runtime

Middleware

inaccel

**FPGA** 

Repository

with accelerators

**Operating System** 

Virtualization/Sharing

Servers with FPGAs

SW developers HW Developers



ML engineers SW developers

























oneAPI









#### **Bitstream repository**



> FPGA Resource Manager is integrated with a bitstream repository that is used to store FPGA bitstreams

https://store.inaccel.com









## Simple invoking, deployment



```
std::string binaryFile = argv[1]:
size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
cl int err;
cl::Context context;
cl::Kernel krnl_vector_add;
cl::CommandQueue q;
// Allocate Memory in Host Memory
// When creating a buffer with user pointer (CL_MEM_USE_HOST_PTR), under the hood user ptr
// is used if it is properly aligned. when not aligned, runtime had no choice but to create
// its own host side buffer. So it is recommended to use this allocator if user wish to
// create buffer using CL_MEM_USE_HOST_PTR to align user buffer to page boundary. It will
// ensure that user buffer is used when user create Buffer/Mem object with CL_MEM_USE_HOST_PTR
std::vector<int, aligned_allocator<int>> source_in1(DATA_SIZE);
std::vector<int, aligned_allocator<int>> source_in2(DATA_SIZE);
std::vector<int, aligned_allocator<int>> source_hw_results(DATA_SIZE);
std::vector<int, aligned_allocator<int>> source_sw_results(DATA_SIZE);
// Create the test data
std::generate(source_in1.begin(), source_in1.end(), std::rand);
std::generate(source_in2.begin(), source_in2.end(), std::rand);
for (int i = 0; i < DATA_SIZE; i++) {</pre>
    source_sw_results[i] = source_in1[i] + source_in2[i];
    source_hw_results[i] = 0;
// OPENCL HOST CODE AREA START
// get_xil_devices() is a utility API which will find the xilinx
// platforms and will return list of devices connected to Xilinx platform
auto devices = xcl::get xil devices();
// read_binary_file() is a utility API which will load the binaryFile
// and will return the pointer to file buffer.
auto fileBuf = xcl::read_binary_file(binaryFile);
cl::Program::Binaries bins{{fileBuf.data(), fileBuf.size()}};
int valid_device = 0;
for (unsigned int i = 0; i < devices.size(); i++) {</pre>
    auto device = devices[i];
    // Creating Context and Command Queue for selected Device
    OCL_CHECK(err, context = cl::Context({device}, NULL, NULL, &err));
    OCL_CHECK(err,
              q = cl::CommandQueue(
                  context, {device}, CL_QUEUE_PROFILING_ENABLE, &err));
    std::cout << "Trying to program device[" << i
              << "]: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;</pre>
    OCL CHECK(err.
              cl::Program program(context, {device}, bins, NULL, &err));
    if (err != CL_SUCCESS) {
        std::cout << "Failed to program device[" << i
                  << "] with xclbin file!\n";
        std::cout << "Device[" << i << "]: program successful!\n";
        OCL_CHECK(err, krnl_vector_add = cl::Kernel(program, "vadd", &err));
```

## No need for OpenCL

No need to allocate buffers No need to specify bitstreams No need to program specific device



```
inaccel::Request add_req {"com.inaccel.math.vector.addition"};
add_req.Arg(a).Arg(b).Arg(c).Arg(size);
inaccel::Coral::Submit(add_req);
```

- Much simpler invoking
- Software-alike function invoking
- No need for OpenCL directives
- Same API for C/C++, Java, Python
- Native API

valid\_device++;

#### **Keras Deployment on Alveo cards**



> Easy deployment of Keras applications



pip install inaccel-keras

```
import time
from inaccel keras.applications.resnet50 import ResNet50
from inaccel keras.preprocessing.image import ImageDataGenerator

model = ResNet50(weights='imagenet')

data = ImageDataGenerator(dtype='int8')
images = data.flow_from_directory('imagenet/', target_size=(224, 224), class_mode=None, batch_size=64)

begin = time.monotonic()
preds = model.predict(images, workers=16)
end = time.monotonic()

print('Duration for', len(preds), 'images: %.3f sec' % (end - begin))
print('FPS: %.3f' % (len(preds) / (end - begin)))
```

2897 fps on U250



https://docs.inaccel.com/project/keras/

## **Graphical monitoring tool**





#### Quantized ResNet50 on multiple Alveo cards



1 Application => 2 Alveo



2 Applications => 1 Alveo



2 Applications => 2 Alveo



### Scaling Keras to 2 Alveo cards



> Same applications => Instant scaling

2870 fps on 1 U250







8x fast forward

### Scaling Keras to 2 Alveo cards



> Same applications => Instant scaling

3681 fps on 2x U250

resources: limits: xilinx/u250: 2







[keras/examples]\$

8x fast forward

## 2 Applications in a single Alveo cards



> Same applications => Instant scaling

2886 fps on 1x U250







8x fast forward

## 2 Applications scaling to 2 Alveo cards



> Same applications => Instant scaling

4851 fps on 2x U250

| InAccel                    | 1 U250  | 2 U250   |
|----------------------------|---------|----------|
| 1 APP (workers = 16)       | 2870.71 | 3681.413 |
| 2 APPs (workers = 16 + 16) | 2886.45 | 4851.603 |









[keras/examples]\$

#### Heterogeneous deployment



- > InAccel FPGA orchestrator is platform agnostic
- > You can deploy your applications to heterogeneous Alveo clusters

| InAccel                           | 1 U250   | 1 U280     | 2 U280   | 1 U250 + 2 U280 |
|-----------------------------------|----------|------------|----------|-----------------|
| <b>1 APP</b> (workers = 16)       | 2897.675 | 1003.132   | 1999.395 | 3939.726        |
| <b>2 APPs</b> (workers = 16 + 16) | 1121     | 11 = 7,2,4 | 200      | 4909.022        |





## Zero overhead, Improved Throughput















https://github.com/Xilinx/Applications/tree/master/GZip

#### **Multi-tenant Vitis deployment**



- > Run Vitis from browser
- > Fully compatible with any Vitis library
- Multi-tenant, multiple applications
- > Scalable deployment



https://labs.inaccel.com:8000/



## Jupyter - JupyterHub



- Deploy and run your FPGA-accelerated applications using Jupyter Notebooks
- InAccel manager allows the instant deployment of FPGAs through HupyterHub



#### JupyterHub on FPGAs



- Instant acceleration of Jupyter Notebooks with zero codechanges
- Offload the most computational intensive tasks on FPGA-based servers



#### Vitis on Alveo cluster on a browser





#### Successful Use cases, Integrations





https://docs.inaccel.com/

#### **Auto-scalable deployment**



- > Starting on prem
- > Moving to the cloud
  - >> Automatically
  - Instantly

InAccel Hybrid Heterogeneous Kubernetes Cluster



#### **Auto-scalable FPGA deployment**



#### Setup the Master node

1. Initialize the Kubernetes control-plane. Use the VPN IP, that the OpenVPN Access Server has assigned to that node (e.g. 172.27.224.1 ), as the IP address the API Server will advertise it's listening on.

```
sudo kubeadm init \
   --apiserver-advertise-address=172.27.224.1 \
    --kubernetes-version stable-1.18
```

To make helm and kubectl work for your non-root user, use the commands from the kubeadm init output.

2. Deploy Calico network policy engine for Kubernetes.

```
kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml
```

3. Deploy Cluster Autoscaler for AWS.

```
helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm install cluster-autoscaler stable/cluster-autoscaler \
   --set autoDiscovery.clusterName=InAccel \
    --set awsAccessKeyID=<your-aws-access-key-id> \
    --set awsRegion=us-east-1 \
    --set awsSecretAccessKey=<your-aws-secret-access-key> \
    --set cloudProvider=aws
```

4. Deploy InAccel FPGA Operator.

```
helm repo add inaccel https://setup.inaccel.com/helm
helm install inaccel inaccel/fpga-operator \
    --set license=<your-license> \
    --set nodeSelector.inaccel/fpga=enabled
```

https://docs.inaccel.com/labs/auto-scaling-aws/



https://www.youtube.com/watch?v=CVVyvXY4w5w

#### **Universities**



- > How do you allow multiple students to share the available FPGAs?
- Many universities have limited number of FPGA cards that want to share with multiple students.
- InAccel FPGA orchestrator allows multiple students to share one or more FPGAs seamlessly.
- > It allows students to just invoke the function that want to accelerate and InAccel FPGA manager performs the serialization and the scheduling of the functions to the available FPGA resources.



#### **Universities**



- > But the researchers want exclusive access
- > InAccel orchestrator allows to select which FPGA cards will be available for multiple students and which FPGAs can be allocated exclusively to researchers and Ph.D. students (so they can get accurate measurements for their papers).
- > The FPGAs that are shared with multiple students will perform on a best-effort approach (InAccel manager performs the serialization of the requested access) while the researchers have exclusive access to the FPGAs with zero overhead.



#### InAccel Coral manager - Kubernetes



- > Integrated solution that allows
  - >> Scale Up (1, 2, or 8 FPGAs per server)
  - >> Scale Out to multiple servers



### Serverless deployment



- Integrated framework for serverless deployment
- > Compatible with Kubeless, Knative
- Users only have to upload the images on the S3 bucket and then InAccel's FPGA Manager automatically deploy the cluster of FPGAs, process the data and then store back the results on the S3 bucket.
- Users do not have to know anything about the FPGA execution.



https://medium.com/@inaccel/fpgas-goes-serverless-on-kubernetes-55c1d39c5e30

#### Test it on your prem or on your browser



#### **On-prem**



#### **Online - Browser**



https://docs.inaccel.com/

https://labs.inaccel.com:8000/

#### InAccel, Inc. Corporate overview



- > Founded in January 2018
- > Registered in Delaware, USA
- > Membership:







Technology Partner





















Application Acceleration, seamlessly

www.inaccel.com

info@inaccel.com

USA:

500 Delaware Ave STE 1, #1960 Wilmington, DE 19801 USA Europe (Design Center):

Formionos 47 Kesariani 116 33 Athens, Greece