#### A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE CONVEY HC-2EX HETEROGENEOUS COMPUTING PLATFORM

BASED ON "PHANTOM: PRACTICAL OBLIVIOUS COMPUTATION IN A SECURE PROCESSOR" FROM CCS-2013

**Martin Maas**, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song





#### A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE CONVEY HC-2FX HETEROGENEOUS COMPUTING PLATFORM **Maxtin Maas**, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song BASED ON "**PHANTOM**: PRACTIOLL OBLIVIOUS COMPUTATION IN A SECU<u>RE PROCESSOR" FROM CCS-2013</u> Cryptographic Construct High-performance, FPGA-based platform Secure Processor

# Organizations move to the cloud



E.g. government, financial/medical companies Raises privacy concerns for sensitive data

### Attackers with Physical Access



**Malicious** 



Malicious Intruders Government<br>Employees Intruders Surveillance **Surveillance** 

### Physical Attack Vectors





E.g. replace DRAM DIMMs with NVDIMMs that have non-volatile storage to record accesses

# Computation on Encrypted Data



e.g. Secure Processors (AEGIS, XOM, AISE-BMT), IBM Cryptographic Coprocessors, Intel SGX

# Memory Address Leakage



Leaks e.g. transactions, subjects of surveillance/ audit, geolocations, OS fingerprints, crypto keys

### A real-world example: SQLite



### We want to prevent this information leakage

In the context of a secure processor

Oblivious RAM (ORAM) • Problem investigated since 1987 • Originally for memory accesses of a processor, later for e.g. FSs, DBs,... • Algorithms required MBs of trusted storage or complex (x Hardware)

# Path ORAM (CCS'13, Best Paper)

#### New algorithm by Stefanov et al.

 $\vee$  Low trusted storage requirement  $\vee$  Simple enough to implement in hardware on a secure processor

#### Where's the problem?

How hard can it be to put Path ORAM into a processor?

### 1. ORAM Microarchitecture

- Prior work algorithmic, ignores ORAM microarchitectural implementation
- ORAM needs to fully utilize resources
	- You need to build it to find the details not apparent from the algorithm

# 2. Practicality on real system

- Want obliviousness for real systems
- Custom chips (ASICs) very expensive unless widely adopted
- There is a trend towards FPGA-based accelerators (programmable H/W)

#### PHANTOM: A Practical Oblivious Computing Platform Featuring an ORAM microarchitecture

implemented on an FPGA platform

# **Overview**

- 1. Overview & Attack Model
- 2. Path Oblivious RAM
- 3. The Oblivious Memory System
- 4. Building PHANTOM
- 5. Evaluation

#### **PART I**  Attack Model and Deployment

#### PHANTOM Overview

![](_page_17_Figure_1.jpeg)

![](_page_18_Figure_0.jpeg)

#### **PART II**  Path Oblivious RAM

### Path Oblivious RAM

#### Oblivious memory is divided into blocks:

![](_page_20_Figure_2.jpeg)

#### When accessing a block through Path ORAM

![](_page_20_Figure_4.jpeg)

#### 1 0 Path Oblivious RAM Position Map (secure)

![](_page_21_Picture_117.jpeg)

![](_page_21_Figure_2.jpeg)

B F A E DI Stash (secure)

### Required Stash Size • Blocks stay behind in the stash

- How large does the stash have to be to never overflow?
	- Bound known up to constant factors: determined constants empirically

#### **PART III**  The Oblivious Memory System

# The Oblivious Memory System

![](_page_24_Figure_1.jpeg)

# High-throughput Memory

![](_page_25_Figure_1.jpeg)

# High-throughput Memory

![](_page_26_Figure_1.jpeg)

# Writing Back Blocks

![](_page_27_Picture_65.jpeg)

![](_page_27_Picture_66.jpeg)

For each node in the path, select an entry from the stash to write to it (or put a dummy).

# Time to pick a block

- In our case, we have 32 cycles to pick the next block (otherwise we will stall the memory system).
- Examining all blocks takes C cycles<br>for each block.<br>for each block. for each block.

# Picking from the full stash

![](_page_29_Figure_1.jpeg)

# Adding a sorting step

![](_page_30_Figure_1.jpeg)

# Heap-based Sorting

![](_page_31_Figure_1.jpeg)

# Timing Channels

#### Operation is data-driven; risk to leak information from timing

- 1. Operation always take the maximum amount of time (avoiding large overheads) or are overlapped
- 2. Decouple DRAM timing variations

Challenge: Side Channels

#### DRAM Buffer

#### Absorb timing variations at periphery

![](_page_33_Figure_2.jpeg)

#### The Whole Picture

![](_page_34_Figure_1.jpeg)

#### **PART IV**  Building PHANTOM

# PHANTOM Prototype

![](_page_36_Figure_1.jpeg)

Implemented on Convey HC-2ex platform

# Integrated with RISC-V CPU

![](_page_37_Figure_1.jpeg)

#### Developed by UC Berkeley's Architecture Group

### PHANTOM Secure Processor

- Integrated a RISC-V CPU with ORAM
- Loads and runs real-world programs, including (in-memory) SQLite
- Not optimized for FPGA yet, very small cache sizes (4KB/4KB/8KB)

## Implementation on the HC-2ex

- Use Convey development kit, bundles Convey and user logic into personality
- Implement Verilog module, interfaces with MCs, management unit, etc.
	- Personality loaded by Convey runtime

# Convey Personality Workflow

![](_page_40_Figure_1.jpeg)

Using this to build two-way communication channel

### Interaction with RISC-V CPU

![](_page_41_Figure_1.jpeg)

RISC-V CPU runs independently but talks to host

### ORAM Microarchitecture

- Fully implemented, except remote attestation and AES units
- ORAM controller tested/verified for millions of random ORAM accesses • ORAM Block Size of 4KB (for now)

# Implementation Challenges

- Many challenges and unknown details
- Min-heap, BRAM multiplexing, block headers, stash management, block caching, timing domains, inter-FPGA communication, block buffering,...

# Min-heap Implementation

#### Need to write and look at two children at every step, running at 150 Mhz

Pre-fetch four grandchildren to avoid long combinational path (read and write to BRAM in the same cycle)

Split into multiple BRAMs to avoid limitation to 2 ports

# Synthesized FPGA Design

![](_page_45_Picture_1.jpeg)

either with a block arriving from or being written to mem-

**PART V**  Evaluation

#### ORAM Access times Total access time 40! 4719 cycles Time per ORAM access (us) Time per ORAM access (us) 35! 4352 30! cycles 25! 20! 32x15! 10! 5! 0.812us 0!  $0$  1 2 3  $0$  1 2 3  $0$  1 2 3  $0$  1 2 3  $0$  1 2 3  $0$  1 2 3  $0$  1  $\sqrt{2}$  3 13 14 15 16 17 18 19 Row 1: ORAM size in levels (64MB-4GB) - Row 2: # cached levels (k) Time for read phase Average of 1M ORAM accesses each (4KB)

## Application Performance

![](_page_48_Figure_1.jpeg)

#### 20%-5.5x Overhead (1MB LLC)

#### 1GB ORAM

#### Future Work

- Prototype is a starting point
	- Integrate additional Path ORAM optimizations, HW/SW co-design
		- Compiler/OS support to avoid ORAM accesses and reduce size of ORAMs

#### Conclusion

• Investigated ORAM microarchitecture to exploit high memory bandwidth • PHANTOM: Make oblivious computation practical on existing hardware

#### Thank you! Any Questions?

![](_page_51_Picture_1.jpeg)

![](_page_51_Picture_2.jpeg)

![](_page_51_Picture_3.jpeg)

![](_page_51_Picture_4.jpeg)

#### **Martin Maas**, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song

{maas, ericlove, emil}@eecs.berkeley.edu, tiwari@austin.utexas.edu, elaine@cs.umd.edu, {krste, kubitron, dawnsong}@eecs.berkeley.edu