I am a recent computer science and engineering graduate from the University of Michigan, interested in computer science theory,
low-level software, and computer architecture. I am focused on developing skills in the intersection of those areas,
particularly high-performance computing and system architecture.
During my time at the University of Michigan I was a member of the M-Fly
autonomous vehicle engineering project team, a TA for an upper level CS
course for compiler design, and a recipient of the engineering scholarship
of honor. A list of relevant coursework is available on my LinkedIn.
Working with Professor Max New to provide office hours, responses to student questions on Piazza,
a weekly discussion section, exam proctoring/grading, and general support for course logistics during the Winter 2025 semester.
Analog Devices
Hardware & Systems Engineering Intern
May - August 2024
During this internship, I was tasked with automating an existing workflow for a Neural Processing Unit's SDK for
PPA (power, performance, and area) analysis. This involved modifying memory map tables, generating linker scripts,
compiling TensorFlow Lite neural networks into binary executables for the hardware, running SystemC simulations,
scraping simulation logs for performance data, generating/simulating parameterizable RTL, and performing power
usage analysis. I also profiled and optimized an existing CUDA and
C++ codebase for 5G signal processing. This involved
downconversion, OFDM via FFTs, channel estimation, demodulation, and EVM calculations on Nvidia GPUs.
The GPU kernels were profiled with a mixture of Nsight Systems, Nsight Compute, roofline analysis, and
Python scripting.
Both tasks were fairly high in visibility and led to significantly improved workflows / data harvesting.
Analog Devices
System Software Engineering Intern
May - August 2023
During this internship, I worked on the product security team and developed a
SystemC model of an elliptic curve
cryptographic block in an embedded security enclave to aid in its functional verification and NIST certification.
The model was designed with transaction level granularity.
4. Projects
Below you will find various notable technical projects I have worked on.
Sharded Paxos based key/value service
Academic project
Decemeber 2024
I implemented a distributed sharded key/value storage system with fault tolerance and consistency using Paxos.
The system partitions keys across multiple replica groups, each responsible for a subset of shards, with a central
"shard master" managing shard assignments and reconfigurations. I developed the shard master to handle configuration
changes using Join(),
Leave(),
Move(), and
Query() RPCs exposed to the client,
ensuring even shard redistribution while maintaining fault tolerance. Later on, I built the sharded key/value
servers to handle client requests with single-copy semantics, designing a robust protocol for shard transfers during
configuration changes. This system successfully manages dynamic reconfigurations and ensures
consistency across all replica groups.
Multiprocessor cache coherence protocol
Academic project
November 2024
I designed and formally verified a directory based cache coherence protocol for a 4 core processor.
The protocol was a modification of the canonical MSI scheme
found in many computer architecture courses. Modifications
mainly included additional handshakes and transient states to deal with the fact that messages could be arbitrarily
reordered by the interconnect network. Formal verification was done by encoding the replicated state machine of
the processor and directory cache controllers in the Murphi language
and exploring the ~700,000 possible state space for inavariant violations and deadlocks.
Paxos based key/value service
Academic project
November 2024
I implemented a Paxos-based key/value storage system that ensures high reliability and fault
tolerance. The system replaces a single master view server with Paxos to manage consensus, enabling all replicas
to process client requests in a consistent order without relying on a single point of failure. My work involved
designing and implementing a Paxos library that supports concurrent agreement across multiple instances,
managing memory efficiently for forgotten instances, and maintaining linearizable semantics for the key/value
service. The project also required careful handling of duplicate requests and ensuring the system could
recover state when replicas lagged behind. By layering the Paxos library, a replicated state machine,
and the key/value server, I structured the implementation to separate concerns and simplify complexity.
Computer security exploits
Several academic projects
January 2024
Disclaimer: All computer security projects listed on this page and subsequent exploits were sanctioned
by the University of Michigan. Any infrastructure mentioned as targets of exploits or breaches are owned
and operated by the university for educational purposes.
1) In the first project, I implemented several exploits against known vulnerabilities in cryptosystems
like MD5, SHA-1, SHA-2, and other hashes using
Merkle-Damgård construction.
First, I wrote scripts which perform a length extension attack and hash collision attack. I also
created scripts that performed padding oracle attacks on a vulnerable endpoint that contained
encrypted communications and made the mistake of
doing decryption before message authentication
in its backend.
2) In the second project, I executed several website security exploits such as SQL injection, Cross-site scripting (XSS),
and Cross-site request forgery (CSRF) against various tiers of defenses.
3) In the third project, I retraced the steps an attacker took to breach a fictional company's wesbite and database. I had
to employ tools like Wireshark, Python scripting, and networking code to find security vulnerabilities in the company's
mobile device management (MDM) infrastructure, particularly in their DNS resolver and their multi-factor authentication tool.
4) In the fourth project, I performed several exploits against application security vulnerabilities, focusing on buffer
overflows and control-flow hijacking. First, I developed input that triggered stack variable overwrites in a controlled
environment, demonstrating the risks of improper memory management. Then, I crafted payloads to overwrite return addresses,
redirecting program execution to arbitrary code. Using tools like GDB and x86-64
assembly, I created exploits that bypassed
security defenses like DEP (Data Execution Prevention) and ASLR (Address Space Layout Randomization). Finally, I applied
reverse engineering techniques with Ghidra to analyze a closed-source binary, identifying and exploiting vulnerabilities
to achieve the desired control. These tasks enhanced my understanding of machine architecture, assembly language, and
the importance of secure coding practices.
5) In the final project, I was tasked with performing a forensic analysis on a fictional cyber criminal named Leslie. I
was provided with a forensic copy of his hard drive and well as his physical machine. In order to find incriminating evidence,
I had to utilize a lot of the techniques from previous projects, including new ones like using stegonography,
password crackers, binwalk, spectrograms, and the Autopsy digital forensics software.
For this project, my team and I designed and implemented a 32-bit RISCV superscalar out-of-order processor in behavioral SystemVerilog.
Simulation and synthesis were performed via Synopsys VCS. Our design was inspired by the MIPS R10000
implementation of Tomasulo's algorithm and
included core components like a reservation station (RS), reorder buffer (ROB), physical register file (PRF), and a retirement
register allocation table (RRAT).
We chose a 3-way superscalar configuration to balance complexity and performance, incorporating
advanced features such as early tag broadcasting (ETB), a non-speculative load-store queue with internal data forwarding from in flight stores to dependant loads,
and a non-blocking L1 data cache with prefetching. ETB enabled dependent instructions to execute back-to-back with minimal delays, while the load-store queue
effectively reduced memory access contention. The load-store queue handled dependency tracking with bit masks. Each load in the RS maintained a bit mask that
indicated which stores in the Store Queue (SQ) were older and a second bit mask was used to track which older stores had unresolved addresses. This design was
inspired by similar mechanisms in the Berkeley Out-of-Order Machine.
Our I$ and D$ were designed with prefetching and non-blocking mechanisms, enhancing data throughput.
Memory handling was enhanced by implementing miss status handling registers (MSHRs) to
reduce stalls caused by cache misses. Performance analysis revealed improvements in cycles per instruction (CPI) compared to
baseline in-order designs, particularly on benchmarks leveraging instruction-level parallelism (ILP). However, challenges
like cache aliasing and limited branch prediction accuracy highlighted areas for future improvement. Additionally, a
React-based GUI debugger was developed to visualize all processor signals at every cycle of a program's execution to aid debugging.
We were able to meet slack with a 13.7ns clock period (~73Mhz frequency).
C
to
x86-64
optimizing compiler
Several academic projects
January - March 2024
Over the course of several projects, I iteratively implemented an optimizing compiler which supported
a sizeable subset of the C language
(which we dubbed Oat) to x86-64 machine code.
The compiler was entirely implemented in OCaml
and followed the AMD64 System V ABI calling conventions. It was written in the following phases.
Phase 1:
Implemented an assembler and simulator for a small, idealized subset of the x86-64 platform
that will serve as the target language for the compiler.
Phase 2:
Implemented a non-optimizing compiler for a
subset of the LLVM IR language
(dubbed LLVMlite) with x86-64 as the target.
At this point, the compiler's backend was largely completed.
Phase 3:
Implemented a non-optimizing compiler for Oat
with LLVMlite as the target. At this point,
the compiler's frontend was largely completed and it supported compiling simple Oat programs.
[Oatv1 rules]
Phase 4:
Implemented new Oat language features
such as structs, function pointers, distinguishing between possibly null and definitely not null references,
array initializers, and updating the type system for supporting all the prior additions.
[Oatv2 rules]
Phase 5:
Implemented compiler optimizations at the LLVMlite IR
level in the backend. These included dataflow analysis, dead code elimination, constant propogation, and a proper
register allocation heuristic instead of placing all variables and intermediate values on the stack. For register
allocation, I chose to implement Chaitin's algorithm
with coservative node coalescing.
Multithreaded network file server
Academic project
November - December 2023
Implemented an ACID compliant network file server using
C++ and the
Boost libraries for regex, thread, and reader-writer lock functionality.
The file server can be run on any Unix machine, utilizes BSD sockets for interprocess communication,
and has several design considerations for fault tolerance.
CNN forwarding layers optimized w/ Nvidia GPUs
Academic project
November 2023
Given a pretrained convolutional neural network written in CUDA
for classifying MNIST-Fashion clothing images into one of several discrete bins, I optimized the provided forwarding kernel
(it comprised 98.95% of total execution time) by utilizing several GPU optimization techniques. These included placing constant
filter values into the GPUs constant memory (which has orders of magnitude less access cycles), rewriting memory access patterns
such that they were coalesced and minimized memory bank conflicts, unrolling loops, and rewriting the algorithm to leverage
the massive parallel compute capability of the Tesla V100 datacenter GPU I worked with. In fact, I was able to parallelize
the work for a batch of 10,000 images, the subsequent iterations over all output feature maps, and the iterations over each
individual pixel. In summary, although the theoretical amount of calculations did not change in any of the kernel invocations,
it was rewritten in such a way that the V100 scheduled as much work as possible across its 80 streaming multiprocessors. The final
kernel ran in 0.17s across both passes, whereas the original implementation took 13.78s in total (~81x speedup).
Unix virtual memory pager
Academic project
September - October 2023
Implemented a simulator of the pager portion of a Unix operating system used to manage application
processes' virtual address space. The pager was written in C++ and
implemented system calls like the Unix fork() which are used
to create, copy, destroy address spaces, allocate more space in existing ones, and switch between address
spaces.
Unix thread library
Academic project
September 2023
Implemented a POSIX-like thread library in C++, enabling thread
creation, synchronization, and context switching on multicore machines. I managed threads using custom thread control blocks
(TCBs) and preemptive scheduling with interrupt safety.