Posts on nathan.rs

BERT is just a Single Text Diffusion Step

Mon, 20 Oct 2025 00:00:00 +0000

A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step.

I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018. The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

Research Log

Tue, 14 Oct 2025 10:54:27 -0500

I’m doing my master’s thesis around distributed low-communication training. Essentially, how can we train large models efficiently across distributed nodes and not be utterly destroyed by network latency and bandwidth? Below is some of what I’ve learned and investigated throughout the days.

Day 3: Current Work on Heterogeneous Workers#

A desirable problem to solve is being able to use different kinds of hardware for training. Even within the same generation, NVIDIA B300 GPUs are 50% faster than B200s. Companies like Meta have many homogeneous clusters that differ in hardware. It would be ideal to be able to train a model across clusters regardless of the exact underlying hardware used.

Running GPT-2 in WebGL: Rediscovering Classic GPGPU Programming

Sat, 24 May 2025 12:20:47 -0700

A few weeks back, I implemented GPT-2 using WebGL and shaders (Github Repo) which made the front page of Hacker News (discussion). This is a short write-up over what I learned about old-school general-purpose GPU programming over the course of this project.

The Origins of General-Purpose GPU Programming#

In the early 2000s, NVIDIA introduced programmable shaders with the GeForce 3 (2001) and GeForce FX (2003). Instead of being limited to predetermined transformations and effects of earlier GPUs, developers were now given unprecedented control over the rendering pipeline, enabling much more sophisticated visual effects. These programmable shaders laid the foundation for modern GPU computing.

Mathematical Statistics

Wed, 21 Feb 2024 14:07:21 -0600

My notes over Mark Maxwell’s course, Introduction to Mathematical Statistics, and his textbook, Probability & Statistics with Applications, Second Edition.

Sampling Distributions and Estimation#

Normally in a probability experiment, we don’t know the true values of a model’s parameters, and therefore, we must estimate them using random observations. Because the observations are random, our estimates are subject to the vagaries of chance. We find ourselves in a paradoxical situation in which the parameters are fixed, but unknown, while the estimates are random, but observable.

Common Probability Distributions

Thu, 08 Feb 2024 12:29:32 -0600

An overview of common discrete and continuous distributions found in probability and statistics, from Mark Maxwell’s textbook, Probability & Statistics with Applications, Second Edition.

Common Discrete Distributions#

Discrete Uniform#

A random variable $X$ is said to have a discrete uniform distribution if its probability function is:

$$Pr(X=x)=\frac{1}{n}$$

for $x=1,2,\dots,n$.

Main Properties#

Expected Value: $$E[X ]=\frac{n+1}{2}$$
Variance: $$Var[X ]= \frac{n^2-1}{12}$$

Additional Properties#

Median: Same as Expected Value
Mode: None

Bernoulli#

A Bernoulli trial is an experiment that has two outcomes (true-false; girl-boy, success-fail, in-out, etc).

How to Fix Hugo's iOS Code-Block Text-Size Rendering Issue

Sun, 04 Feb 2024 17:23:27 -0600

Lately, I’ve been coming across many blogs that have weird font-size rendering issues for code blocks on iOS. Basically, in a code snippet, the text-size would sometimes be much larger for some lines than others.

Below is a screenshot of the issue from a website where I’ve seen this occur.

As you can see, the text-size isn’t uniform across code block lines. I’ve seen this issue across many blogs that compile markdown files to HTML such as sites built using Hugo, Jekyll, or even custom md-to-html shell scripts.

Intro to Autograd Engines: Karpathy's Micrograd in Go

Sat, 11 Nov 2023 08:57:53 -0600

For a while, I wanted to build a complete autograd engine. What is an autograd engine, you might ask? To find the answer, we first must know what a neural network is.

Neural Network Crash Course#

A neural network can just be seen as a black-box function. We pass in an input into this black box and receive an output. Normally, in a function, we define the rules on how to manipulate the input to get an output. For example, if we want a function that doubles the input, i.e $f(x) = 2x$, then all we would write is:

Where Rust Shines: Algebraic Types and Match Statements

Sat, 11 Nov 2023 08:51:57 -0600

Recently I was going through Thorsten Ball’s “Writing An Interpreter in Go”. In this book, you create a basic interpreted language and write a lexer, parser, evaluator, and REPL for it.

A Lexer takes in source code and turns it into an intermediate representation, usually in the form of a string of tokens. This is called Lexical Analysis. A parser usually takes this stream of tokens and turns it into an Abstract Syntax Tree which is then evaluated and run.

Favorite Books

Sat, 14 Oct 2023 10:45:41 -0500

Below are all the books I’ve read since middle school, roughly in order. Those highlighted in blue were those I particularly enjoyed :)

2025 – Age 22#

Willpower - Roy F. Baumeister & John Tierney

I was at a ZFellows event in SF where one of the speakers, Adam Guild, recommended this book. It's been a long time since I've read a self-help book like this. I gave it a shot because I've felt lazy recently and was looking for something to help me pick up my old good habits again. EDIT: I did start waking up early again, there has been a directional shift.

Deng Xiaoping and the Transformation of China - Ezra F. Vogel

Favorite Quotes

Sat, 14 Oct 2023 10:45:41 -0500

Here are a few of my favorite quotes I’ve liked over the years.

Life#

“I believe that a man should strive for only one thing in life, and that is to have a touch of greatness”
— Félix Martí-Ibáñez

“In Three Words, I Can Sum Up Everything I’ve Learned About Life. It Goes On”
— Robert Frost

“But you see,” said Roark quietly, “I have, let’s say, sixty years to live. Most of that time will be spent working. I’ve chosen the work I want to do. If I find no joy in it, then I’m only condemning myself to sixty years of torture. And I can find the joy only if I do my work in the best way possible to me. But the best is a matter of standards—and I set my own standards. I inherit nothing. I stand at the end of no tradition. I may, perhaps, stand at the beginning of one.”
― Ayn Rand, The Fountainhead

Gradient Descent & Optimizers

Tue, 10 Oct 2023 10:24:32 -0500

Theses are some of my over Qiang Liu’s course, Machine Learning II.

Gradient Descent#

Gradient Descent is a fundamental, first-order iterative optimization algorithm designed for minimizing a function. The primary objective of Gradient Descent is to find the minimum value of a function by iteratively moving towards the minimum of the gradient.

Update Rule: The parameters $ \theta $ are updated as follows in each iteration:

Language Modeling: Word Embedings & Architectures

Sat, 07 Oct 2023 15:54:05 -0500

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin.

Word Embeddings#

Word embeddings are a type of word representation that captures the semantic meaning of words in a continuous vector space. Unlike one-hot encoding, where each word is represented as a binary vector of all zeros except for a single ‘1’, word embeddings capture much richer information, including semantic relationships, word context, and even aspects of syntax.

Neural Networks: RNNs, Seq2Seq, & CNNs

Sat, 07 Oct 2023 14:55:20 -0500

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin.

Recurrent Neural Networks (RNNs)#

Recurrent Neural Networks (RNNs) are a class of artificial neural networks specifically designed to tackle sequence-based problems. Unlike traditional feedforward neural networks, RNNs possess a memory in the form of a hidden state, enabling them to remember and leverage past information when making decisions. This makes them particularly effective for tasks like language modeling, time-series forecasting, and sentiment analysis.

Classifiers: Generative & Discriminative Models

Sat, 07 Oct 2023 14:36:35 -0500

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin.

Generative Models vs. Discriminative Models#

When it comes to classification, models are broadly categorized into Generative Models and Discriminative Models.

Generative Models#

In generative models, we aim to model the joint distribution of the data $ p(x, y) $. These models often assume a particular functional form for both $ P(x|y) $ and $ P(y) $. To classify a new data point, we maximize:

Probability

Sat, 02 Sep 2023 11:14:53 -0500

My notes over Mark Maxwell’s course, Probability I, and his textbook, Probability & Statistics with Applications, Second Edition.

Combinatorial Probability#

The fundamental theorem of counting is also known as the multiplication principle.

Given that there are $N(A)$ outcomes, and for each of these outcomes, there are $N(B)$ outcomes, then the total number of outcomes for the two combined is equal to $N(A)\cdot N(B)$.

Example 1

Linear Algebra

Wed, 30 Aug 2023 11:10:18 -0500

These are my notes over my review of Linear Algebra, going through Gilbert Strang’s Introduction To Linear Algebra.

Introduction to Vectors#

The core of linear algebra is vector addition and scalar multiplication. Combining these two operations gives us a set of linear combinations.

$$ c\mathbf{v} + d\mathbf{w} = c\begin{bmatrix} 1 \\ 2 \end{bmatrix} + d\begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} c + 3d \\ 2c + 4d \end{bmatrix}. $$

Rust Front-End Development with Dioxus

Wed, 16 Aug 2023 00:00:00 +0000

October 14th, 2025: This post is old and is most likely outdated if you’re reading this! Dioxus has possibly changed a substantial amount, thus do not read this as a how-to-guide.

Why Rust for Front-End Development#

I’ve been using React and Next.js for front-end development ever since high school, it was one of the first few things I learned when it came to programming. Recently, I’ve had the itch to learn something new, specifically Rust front-end. As someone with a “.rs” domain, it felt like an inevitable fate. Finally, I can say I put the “.rs” in the “nathan.rs”.

Basic Calculus

Fri, 06 Jan 2023 11:19:22 -0600

A small review over Calculus 1, 2, and 3, based on the textbook, Calculus: Early Transcendentals (Eight Edition).

Differentiation Rules#

Product Rule#

If $f$ and $g$ are both differentiable, then

$$\frac{d}{dx}[f(x)g(x)]=f(x)g^\prime(x)+g(x)f^\prime(x)$$

Quotient Rule#

If $f$ and $g$ are differentiable, then

$$\frac{d}{dx}\bigg[\frac{f(x)}{g(x)}\bigg]=\frac{g(x)f^\prime(x)-f(x)g^\prime(x)}{[g(x)]^2}$$

Integration#

The Substitution Rule#

If an integral has both an $x$ value and the derivative of that $x$ value, you can use u-substitution. $$\int x x^\prime dx = \int u du$$

This Mountain We Climb

Wed, 03 Mar 2021 00:00:00 +0000

This is a poem that I wrote my senior year of high school in AP Literature.

Here we all are, this mountain we climb, the sure ascent, that lasts a lifetime, at the golden summit, a goal we all seek the meaning of life, at its Godly peak.

Up we should go, a noble direction. Yet why do so many, rebel in rejection Up is worthwhile, this mountain we climb, at the apex is all that’s sublime.