## **Summary**

Strong C++/Python coding. Experience in driving processor implementation through chip prototype. Deep understanding of computer systems and VLSI. Solid grasp of computer science/machine learning fundamentals.

# Experience

Intel Austin, Texas

# **Computer Architect (System-on-Chip Performance)**

2017-present

Modeled SoC performance and predict workload performance at power for server processors.

Developed tools/automation to boost team efficiency/productivity.

Enhanced SoC modeling capabilities, simulator features, multi-simulator integration (C++).

Conducted pre-silicon SoC performance tuning, analysis, and validation – interconnect/memory traffic.

## **Software Engineer (Design Automation)**

2016-2017

Developed flows/methodologies for automatic place/route and timing closure of CPU designs (successful tape-in).

## Oualcomm Research

San Diego, California

Research Intern

Summer 2013

Performed mixed-signal circuit design verification, post-silicon measurements, and FPGA prototyping.

## **Skills**

| Programming                                 | Machine Learning/Tools              | Processor Implementation          |
|---------------------------------------------|-------------------------------------|-----------------------------------|
| Strong in $\overline{C_{++}, Python}$ , Tcl | Strong in Pandas. PyTorch, SKlearn. | SystemVerilog/Verilog, SystemC    |
| Basic Java, Clojure, Javascript, Perl       | Git, Docker, Spark, Dashboards      | Platform Architect, Simics        |
| Unix shell, HTML, SQL, Node.js              | XGBoost, Regressions, Efficient ML  | Place-Route, DFT, Timing, DRC/LVS |

# **Select Publications/Awards**

A Logic-on-logic 3D-stacked Heterogeneous Multi-core Processor. IEEE ICCD 2017.

Physical Design of a 3D-stacked Heterogeneous Multi-core Processor. IEEE 3D-IC 2016.

Ranked 34<sup>th</sup> in USA, IEEExtreme 24-hour Programming Competition, 2014. Team of 2.

Best FPGA Implementation at International LSI Design Contest, Japan 2009. Xilinx Award. Team of 3.

#### Education

# North Carolina State University

Raleigh, North Carolina

3.98/4.0. 2016

# Ph.D. in Computer Engineering

Dissertation: Three-Dimensional Integration of Heterogeneous Multi-Core Processors.

Research team built a functional 3D-IC processor chip. Developed custom 3D-IC physical implementation flow. Performed architecture analysis, verification, and entire RTL-GDS2 back-end flow up to deliverable layout.

Teaching Assistant (graduate-level): Design of Digital Systems, Computer Design & Technology.

Software Engineering Advanced Microarchitecture ASIC Design Electronic Sys. Level Design Computer Networks Parallel Computer Arch. ASIC Verification Physical Design Memory Systems IC Technology & Fabrication VLSI Systems Design Computer Design & Tech. Embedded Systems Design Digital Electronics Modern Comp. Algebra-AU VLSI System Testing (Duke U.)

# Duke University Visiting Scholar: coursework, research collaboration

Durham, North Carolina

## Bandung Institute of Technology

Indonesia

# B.S. in Electrical Engineering (Computer Engineering track), with distinction

2009

2013

Thesis: C implementation neural network and Kohonen SOM, training/inference, floating/fixed point, on a multi-core Parallax microcontroller. TA: Digital Systems, Microprocessor Lab.

## Oita University Japan

## **Exchange Student, Research & Coursework**

2007-2008

Research: Implemented face follower on a panning camera using neural networks (implemented in C).

# **Project Experience**

## **Machine Learning**

PyTorch: Integrated and analyzed model quantization coupled with feedback alignment training algorithm (open-source libraries).

Experimented on developing custom learning algorithms (back-propagation algorithm alternatives), e.g binarized neural network with greedy training approach.

Benchmarking of MobileNet, SqueezeNet quantized/non-quantized models on Android using TensorFlow Lite.

## Silicon Implementation / Tape-outs

Successful academic tape-out (functional 3D-IC processor chip) of a heterogeneous multi-core processor system with thread migration features at NCSU. Processor implementation has two stacked dies of  $5.25 \text{ mm} \times 5.25 \text{ mm}$  on a 130 nm process.

## RTL Design, FPGA Prototyping

Implemented "Sokoban" (moving box puzzle game) on FPGA: coded the game in MIPS assembly by hand (prototyped in C). Wrote MIPS processor RTL from scratch (team effort, 1 GHz clock in a commercial 180 nm process). Wrote the Verilog code to interface with FPGA buttons and render VGA graphics. Created game sprites.

## **Memory Systems**

Performed modelling and performance comparison between ideal and non-ideal block placement policy for multi-core systems. Cache block placement policy: requestor core cache vs remote core cache. Analyzed experiment results from running SPEC2K benchmarks in SIMICS.

## ESL & Physical Design

Performed TLM & ESL modelling of an SoC design that consists of an ARM Cortex core, DRAM model, and AMBA bus. Performed physical design optimizations, signal integrity analysis, power analysis, timing analysis. Tools: SystemC, Mentor Graphics Vista, Catapult, Python, C++, UML, Encounter, Primetime.

## **Parallel Computer Architecture**

Implemented a MSI, MESI, MOESI cache coherence protocols simulator in C++.

Explored cache coherence protocols to reduce off-chip memory accesses.

#### Computer Design and Technology

Implemented a generic cache simulator, branch target buffer simulator, and Tomasulo superscalar processsor simulator in C++.

Implemented a checkpoint recovery mechanism for large fetch window processor within SimpleScalar simulator environment in C++.

## **Advanced Microarchitecture**

Implemented and compared thread migration strategies within SimpleScalar simulator in C++.

## ASIC Verification

Verified an out-of-order superscalar core (FabScalar) for tape-out, found design bugs in load-store unit and issue queue. Created a reusable SystemVerilog testbench executed in QuestaSim.

## **Digital Electronics**

Designed a low power Hybrid Latch Flip-flop in academic 45 nm tech library. Operating clock frequency 4GHz, power consumption 19.9 uW, setup time 13.5ps, hold time 86ps,  $t_{DQ}$  of 63.64 ps.

Designed a voltage-mode and current-mode differential transmitter circuit. Tools: HSPICE.

#### VLSI Systems Design

Designed a full-custom 3x3 arbiter-crossbar CMOS unit, 2nd best performance and energy\*delay-squared metric out of 27 teams. Customized power delivery network and clock tree design. Created custom standard cell library and top-level integration. Achieved 5.5 GHz clock frequency, 0.19 nW power, with FreePDK45 technology library. Tools: Cadence Virtuoso, HSPICE, Calibre DRC-LFD.

#### ASIC Design

Implemented a Viterbi Decoder in RTL Verilog. Optimized throughput and delay per unit area metric by designing a fast floating point unit, using dual port memory, and pipelining.

## **Online Courses**

Startup Engineering (Coursera), Analysis of Algorithms, Scalable Machine Learning (edX).