# Irvyn Hall

Software engineer.

I like writing software, and enjoy building.

## Contact

- Email: irvynhall@gmail.com
- Location: Aberdeen, Scotland
- GitHub: https://github.com/1rvyn
- LinkedIn: https://linkedin.com/in/irvynf1
- Resume PDF: https://irvyn.us/resume/irvyn-hall-resume.pdf

## Current Profile

Motivated by hard problems and AI.

Relocation: Willing to relocate: Anywhere

## Selected Work

- [f1-machine learning](https://irvyn.us/f1): I recently setup a realtime ML project to run and publish json _artifacts_ in realtime during F1 weekends, I wrote about it on here too.
- [Controlteam HQ](https://controlteamhq.com/): A WIP platform to spin up EC2 instance with openclaw setup letting users control their own assistant.
- [Neal multi-modal](https://neal.irvynhall.workers.dev/): Multi-modal search on cloudflare edge - an example of the capabilities of gemini embedding-2.

## Recent Writing

- [Gen 5 NVMe's are fast](https://irvyn.us/writing/gen-5-nvme/index.md), 5 April 2026: EC2 disk speed benchmarks.
- [F1 ML Notes](https://irvyn.us/writing/f1-ml/index.md), 29 March 2026: F1-ml X auto-research.
- [How fast is your CDN](https://irvyn.us/writing/why-this-site-is-small/index.md), 7 March 2026: Static assets are all you need; build less.


---

# Resume: Irvyn Hall

Motivated by hard problems and AI.

## Contact

- Email: irvynhall@gmail.com
- Location: Aberdeen, Scotland
- GitHub: https://github.com/1rvyn
- LinkedIn: https://linkedin.com/in/irvynf1
- Resume PDF: https://irvyn.us/resume/irvyn-hall-resume.pdf
- Relocation: Willing to relocate: Anywhere

## Skills

### Languages

- Go
- Java
- JavaScript
- TypeScript
- Python
- C#
- C++
- HTML
- PowerShell
- SQL
- React

### Cloud

- Cloudflare
- AWS
- Azure
- Google Cloud
- Vercel
- OpenAI

### Systems

- Node
- Svelte
- Next.js
- Docker
- Kubernetes
- PostgreSQL
- Redis
- Prometheus
- Grafana
- Linux
- WebSockets

### Tooling

- Git
- CI/CD
- Jira
- Teams
- Slack
- Gradle

## Experience

### Software Engineer II, Opus 2

March 2025 - Present

Stack: Go, Python, CUDA, AWS EC2, ASR, LLM, Data Science

- Scaled and hardened internal whisper GPU infra across AZ's.
- Improved BE + FE performance of realtime ASR services to let editors work on transcripts with unlimited size.
- Having fun leading an iniative to implement LLM/AI post-processing to our ASR pipeline.

### Software Engineer, SecuriGroup Ltd

August 2023 - March 2025

Stack: Go, Protocol Buffers, GraphQL, Azure, Prometheus

- Implemented self-hosted observability with Prometheus, Grafana, and Bugsnag, improving triage time.
- Built backend Go APIs across Protocol Buffers, GraphQL, and REST for 25,000+ users across the UK and Ireland.
- Architected and deployed an Azure AI/ML document pipeline for 1,200+ live clients. RAG over insurance documents that made it into production!
- Led DevOps work across Azure, AWS, and internal systems, speeding up docker builds and deployments.
- Mentored a junior developer in Go and day-to-day engineering practice.

### Software Engineer, Aberdeen Drilling School

May 2023 - July 2023

Stack: C++, Go, PostgreSQL, Redis, OpenAI

- Worked across C++, PowerShell, Go, HTML, JavaScript, PostgreSQL, Redis, CI/CD, and OpenAI APIs.
- Designed integration tests that improved installation and cleanup time by minutes.
- Built automation scripts and a RAG-based internal chatbot over company documentation.

### Lab Assistant, Robert Gordon University

September 2022 - May 2023

Stack: Node, JavaScript, MongoDB, REST

- Helped the dean of our school Dr John Isaacs teach Dynamic Web Development classes covering Node, jQuery, WebSockets, Express, Linux, REST APIs, and MongoDB.

### Summer Intern Software Engineer, Leonardo

May 2022 - August 2022

Stack: C#, GraphQL

- Led the refactoring of C# services from REST to GraphQL, reducing over-querying and improving performance.
- Contributed to planning work for moving internal testing projects toward private-cloud architecture.

### QA Engineer, Moonsworth LLC

January 2021 - October 2022

Stack: Python, A/B testing, Analytics, Gaming, Data Science

- Engineered a heuristic packet-based anti-cheat system that detected more than 300,000 cheaters.
- Wrote internal documentation and trained staff on anti-cheat analytics.
- Drove release QA and experimentation work that reduced bug reports by roughly 80%.
- Managed a team of 4 engineers by creating and tracking work through Trello and Jira.

## Education

### Robert Gordon University

- Degree: Bachelor of Science in Computer Science
- Dates: September 2021 - July 2023
- Location: Aberdeen
- Notes: 4.0 GPA/1st Class

### North East Scotland College

- Degree: Diploma of Higher Education in Computer Science
- Dates: September 2019 - August 2021
- Location: Aberdeen
- Notes: 4.0 GPA


---

# F1 Lab

Static F1 page that flips between live race-weekend predictions and predicted-vs-actual reviews after the flag.

## What It Is

- Human page: https://irvyn.us/f1/
- Public artifact base: https://f1-data.irvyn.us
- Current pointer: https://f1-data.irvyn.us/current_pointer.json
- Latest event outlook: https://f1-data.irvyn.us/latest/event_outlook.json

## Purpose

The project publishes live F1 ML prediction artifacts during race weekends so the portfolio can show deeper context than a timing screen: projected finish order, model-vs-track deltas, telemetry-derived evidence, and post-race review artifacts.

## Implementation Summary

- The portfolio page is static HTML, CSS, and JavaScript.
- The live ML pipeline publishes JSON artifacts to Cloudflare R2.
- The browser reads public JSON artifacts from https://f1-data.irvyn.us.
- The live-control side uses Cloudflare Workers, Workflows, Durable Objects, R2, and Containers.


---

# Writing

Posts on fast systems, accurate systems, and tradeoffs.

- [Gen 5 NVMe's are fast](https://irvyn.us/writing/gen-5-nvme/index.md), 5 April 2026: EC2 disk speed benchmarks.
- [F1 ML Notes](https://irvyn.us/writing/f1-ml/index.md), 29 March 2026: F1-ml X auto-research.
- [How fast is your CDN](https://irvyn.us/writing/why-this-site-is-small/index.md), 7 March 2026: Static assets are all you need; build less.


---

# Gen 5 NVMe's are fast

Date: 5 April 2026

Summary: EC2 disk speed benchmarks.

Canonical HTML: https://irvyn.us/writing/gen-5-nvme/

I wrote a quick benchmark over the weekend to piece the parts together (implementation details dont matter much for this sake) - grafana and a few `.tf` files to spin up and config the EC2's. 

https://github.com/1rvyn/irvyn-puffer

Its quite self explanatory, my motivation behind it

## Architecture

My laptop triggers the run, two "writers" stream live 8-channel "audio" (just ffmpeg generated audio), and the target EC2 instance does the ingest and flush work (100ms).

![Benchmark architecture overview](https://irvyn.us/writing/gen-5-nvme/arch.png)

That layout is what let me compare instance types without hiding the write path behind extra layers. The target instance and the writer nodes are doing the real work; everything else is just orchestration and observability.

## 32-conccurent audio streams run

Yellow = ? guess
Green = ? 

The p95 means added latency PER chunk of 100ms getting wrote to disk - in a realworld scenario its likely more total delay since you would often upload 100ms chunks to a WS then recieve text response from external STT service (so its quite crucial we keep the `flush_p95` low) 

At the end of the day this is a synthetic benchmark, not too accurate and just looks to measure the "latency" around writing 100ms chunks when conccurency is high. Single instance monoliths can stream a _LOT_ of audio in reality...

![32-channel benchmark dashboard](https://irvyn.us/writing/gen-5-nvme/wav-8chan-32x.png)

## 16 run

![16-channel benchmark dashboard](https://irvyn.us/writing/gen-5-nvme/wav-8chan-16x.png)

The `i7i` and `t3` are relatively close on lower runs (still a decent delta)

## Local NVMe sanity check

Motivation for this (this is from the recently released apple M5 disks - the extremely expensive 8TB option has more chips on its NVMe so is the only variant capable - also a few thousand £ extra...)

![MacBook Pro local NVMe result](https://irvyn.us/writing/gen-5-nvme/mac-m5-max-nvme-5.webp)


## AWS storage comparison

A post from a Planetscale employee who motivated me to run this benchmark:

![R7i gp3 versus i7i NVMe throughput comparison](https://irvyn.us/writing/gen-5-nvme/r7i%20gp3%20vs%20i7i%20nvme.jpeg)


## Instance details

![i7i.large instance pricing and shape](https://irvyn.us/writing/gen-5-nvme/Screenshot%202026-04-04%20at%2011.18.40.png)

![i7i family storage limits and bandwidth table](https://irvyn.us/writing/gen-5-nvme/Screenshot%202026-04-04%20at%2012.19.02.png)


---

# F1 ML Notes

Date: 29 March 2026

Summary: F1-ml X auto-research.

Canonical HTML: https://irvyn.us/writing/f1-ml/

# F1
As a Mercedes F1 fan it has been a refreshing start to 2026, almost like the feeling of lockdown lifting. 

We finally have decent cars and its convincing enough for me to wakeup at 6am on a Sunday to make a coffee and watch the race. 

However I wanted something more to keep me obsessed with the race weekends and to keep me locked-in, I decided to try doing some ML that learns + predicts as the session + season goes on. I want to also learn more about Cloudflare's developer platform so it made sense to build the page + stuff behind https://irvyn.us/f1

To explain what it is from an architecture stand-point:

```text
      ----------------------------------------------------------------------------------
      crons                -> hourly + 5-min + 1-min triggers on race-weekend days
      workflow binding     -> a step-by-step large TS file for scraping the start times of f1 sessions
      durable object       -> helps ^ by effectively being a durable version of the data ^ produces
      buckets              -> F1_PRIVATE_BUCKET + F1_PUBLIC_BUCKET (S3 buckets - called R2 in CF world)
      containers           -> ReplayPublisherContainer + LiveSourceBridgeContainer
      ----------------------------------------------------------------------------------
                                           |
                                           v
                              +-----------------------------+
                              | Worker                      |
                              | reconcile schedule + start  |
                              +-------------+---------------+
                                            |
                         +------------------+------------------+
                         |                                     |
                         v                                     v
              +-----------------------------+       +-----------------------------+
              | R2                          |       | Durable Object              |
              | persist plans + "overrides" |       | keep active session state   |
              +-------------+---------------+       +-------------+---------------+
                            |                                     |
                            +------------------+------------------+
                                               |
                                               v
                              +-----------------------------+
                              | Workflow                    |
                              | wait, then start "runners"  |
                              +-------------+---------------+
                                            |
                                            v
                              +-----------------------------+
                              | Containers                  |
                              | run ML, write JSON, stop    |
                              +-------------+---------------+
                                            |
                                            v
                              +-----------------------------+
                              | R2                          |
                              | store published JSON and    |
                              |  let this site read it      |
                              +-----------------------------+
```

Sure most of this can be done in 1 VPS with Go but with cloudflare you get the above primitives like Containers + serverless workers (which are fast and located in unique geographical locations since CF has its own data centers) on a free / cheap 5$/m tier. Which can scale to let you do some cool stuff. 

So the main idea of above is 
1. We need to know when FP1-3/Q/Race starts
2. Store that in a place for it to be used
3. Use it to smartly start the compute for running live ML (its not much 1vCPU + 6gb of RAM) 
4. Post the ML results into R2 (S3 equivelent)
5. Read the json live as it streams (essentially an append only log for the race)

The page at /f1 is essentially static html + js + css, then as a session is live it will digest and poll the `latest.json` consuming the live ML predictions easily. 

Obviously it is quite a trivial pattern above for a decent SWE or anyone with a LLM coding partner could setup but I wanted to do more experiments on the actual underlying ML; so naturally I did. I had been reading about this _thing_ called Pi etc and how it can be used for *?autoresearch?* 

# Base ML

I knew the starting point should be boring.

Not a giant model, not some overfit racing oracle, just a small tabular model that could deal with a noisy weekend state and publish stable JSON for the page. The core lane was basically sklearn-style structured prediction over canonical race-weekend rows, then a projection layer on top that turns those raw outputs into something the page can actually render.

That was enough to get something live, but it also made the weak spots obvious very quickly. The Japan weekend is what really exposed it: the model could see strong McLaren practice evidence and still project them too far down because priors and post-processing were overpowering the current weekend.

## What my original sklearn model was like

 Classic tabular ML stack:

- inputs: one row per driver
- features: team, engine, track type, prior ratings, form, practice pace, long-run laps, qualifying rank, top speed, weather, etc
- vectorization: flatten that row into a sparse numeric feature vector
- models:
  - classifier for `win`
  - classifier for `podium`
  - classifier for `top10`
  - regressor for expected finish position
- calibration: isotonic-style smoothing so the probabilities were less silly
- output: JSON prediction rows for the site

So mentally it was something like:

```text
driver/weekend snapshot
        |
        v
[feature row]
team=McLaren
stage=post_practice
prior_team_rating=...
practice_trimmed_pace_gap_s=...
qualifying_rank=...
top_speed_delta_kph=...
weather=...
        |
        v
[DictVectorizer]
turn mixed fields into model matrix
        |
        v
[small sklearn models]
P(win), P(podium), P(top10), expected_finish
        |
        v
[calibration + projection logic]
smooth probabilities
rank field
build finish intervals
        |
        v
[event_outlook.json]
what /f1 actually renders
```

The important part is how naive it was in a good way. It was not trying to learn some magical hidden representation of Formula 1. It was mostly saying:

"Given the priors, the current session evidence, and a few race-context features, what is the probability this driver wins / podiums / finishes in the top 10, and what finish position does that imply?"

That is exactly why autoresearch was useful. Once the baseline was simple enough to understand, the failure modes were also simple enough to debug. The problem was not "the model needs to be bigger". The problem was "this specific pipeline is weighting the wrong things at the wrong time."

# Autoresearch - what it is

[`pi-autoresearch`](https://github.com/davebcn87/pi-autoresearch) is the loop I used to attack that problem. Zoomed out, it is not "an AI model". It is experiment infrastructure for an agent.

The useful mental model is:

- Pi is the terminal runtime and dashboard layer.
- `pi-autoresearch` is the extension/skill that turns that runtime into an edit -> benchmark -> log -> keep/discard loop.
- Codex is the coding agent doing the actual repo work inside that loop.
- `autoresearch.sh` is the judge.

That separation matters. I did not want to build a one-off API harness that generated patches and hoped for the best. I already had a Codex subscription, and what I actually needed was a repo-aware coding agent with shell access, git access, file editing, and enough context to keep iterating inside a real Python project. In this setup Pi handled the experiment loop and UI, while `codex exec --full-auto --json` acted as the worker that actually made changes and ran the repo.

In the F1 shadow workspaces I pinned `pi-autoresearch` to upstream commit `62feb2f46ef2a1b8e39af381b47acc4d7af42ca8`, seeded the run with an `autoresearch.md` file describing the objective and guardrails, and let the loop work from there. Each experiment had:

- a benchmark contract in `autoresearch.sh`
- correctness backpressure in `autoresearch.checks.sh`
- stop rules in `autoresearch.config.json`
- append-only run history in `autoresearch.jsonl`
- current state in `autoresearch.state.json`
- generated reports in `summary.json` and `summary.md`

That made the whole thing much more verifiable than "I prompted a model a lot and vibes-checked the output". A run either improved the benchmark under the same contract or it did not. Kept runs survived as commits. Discarded runs were reverted but still logged. The notes in `autoresearch.md` acted as the seed file and memory for the next iteration.

The terminal view while it was running looked like this:

![Pi autoresearch running in the terminal](https://irvyn.us/writing/f1-ml/Screenshot%202026-03-29%20at%2019.57.58.png)

The small dashboard detail I liked most is that it makes the loop legible. You can see how many runs happened, how many were kept, the current primary score, the confidence multiple, and a bunch of second-order metrics that stop you from accidentally "improving" the benchmark by breaking the thing you actually care about.


# Autoresearch round 1 

Round 1 was the direct response to the Japan post-practice failure mode: McLaren looked too weak even when the weekend evidence said the opposite.

This round taught the main lesson of the whole exercise: the bug was not "the model is too simple". The bug was that priors and projection logic were crushing current-weekend evidence.

The changes that actually mattered were:

- shrink priors toward neutral when post-practice evidence is strong
- clip noisy practice pace and degradation signals
- add teammate-aware expected-finish blending
- add a tiny podium-only post-practice front-runner boost
- tune the projection blend weights instead of replacing the model family

The baseline proxy had `primary_score = 0.210087` and a very bad `japan_mclaren_projection_drop = +0.2526`. The stable Round 1 lane got to about `0.1987` and flipped that projection drop negative. That was the actual win: stop the projection layer from making an already plausible McLaren read worse.

What I cared about enough to keep:

- evidence-weighted post-practice features
- better handling of noisy practice inputs
- teammate-aware projection logic
- very small, explicit post-processing repairs instead of giant model churn

This was the round that produced the most obviously production-worthy behavior changes. It was not glamorous, but it made the live model less dumb.

What I explicitly did **not** want to keep:

- raw same-stage team-context features that looked amazing on the benchmark but re-broke the Japan sanity case

That last point matters a lot. One of the discarded lanes got the raw score down to `0.19046`, but it was still discarded because it made the real-world behavior worse again. That was the point of running the loop with hard side metrics instead of a single scalar and calling it done.

# Autoresearch round 2

Round 2 started from the kept Round 1 checkpoint and asked a better question: if Round 1 fixed the obvious symptom, what is the more structural version of that fix?

The two ideas that mattered were:

- a hard post-practice projection guardrail
- a pace-vs-reliability split in the priors

The guardrail idea is very software-engineering-coded: if the raw model already says a front-runner looks strong and the practice evidence is real, the post-processing layer should only be allowed to demote that driver by a tiny amount. If the projection layer is about to do something obviously dumb, put a boundary around it.

The pace-vs-reliability split was the more ML-shaped idea. Earlier DNFs and reliability chaos were bleeding too directly into pace-sensitive priors. Round 2 separated "this car was fast" from "this weekend ended badly", which is much closer to how you would actually reason about a Formula 1 team.

Round 2 improved the guarded benchmark again to `0.194170`, but it also produced a very useful warning: the shadow artifact itself was not automatically production-ready just because the benchmark improved. The Japan panel became too Piastri-heavy and still left Norris too low, so the right thing to keep was the code-level insight, not the full shadow output.

So for production I cared much more about the guardrail logic and the shape of the prior fix than I cared about promoting the exact Round 2 shadow bundle.

So the production-facing summary from both rounds is basically this:

- ship guardrails and behavior fixes, not every benchmark winner
- let current-weekend evidence overpower stale prior noise when the evidence is strong
- keep reliability separate from raw pace as much as possible
- treat post-processing as part of the model, because that is where a lot of the real bugs actually are

That is also why I liked the `pi-autoresearch` loop for this. It was not just searching for lower numbers. It was forcing me to learn and understand what "better" actually meant, record what failed, and keep the bits that were portable back into the real `f1-ml` container instead of blindly shipping the best-looking shadow run. I think all future ML should be time wasted if these loops weren't explored for automatic iterative improvements.


---

# How fast is your CDN

Date: 7 March 2026

Summary: Static assets are all you need; build less.

Canonical HTML: https://irvyn.us/writing/why-this-site-is-small/

In the days where it is so easy to build *things*, its nice to build less. Sometimes there is software that comes along that makes you realise how slow and bloated the internet is becomming. Some notable examples are Zed, Linear, Planetscale, Turbopuffer etc (not just their websites) - They all do something but its clear they care about the product and it's performance. Unlike many other examples.

I think this pattern will become more common, engineers can now build by just asking a few questions and sending off some prompts - systems they have no business in building and often don't match the problems and use-case. 

The skill now in software engineering isn't just deep technical knowledge, but also understanding what to add to the product to deliever value and what _NOT_ to build.

## Things that I think about:

These are sidecar references rather than core page payload. They load after the page renders, but they mirror the kind of throughput, latency, and region-to-region tradeoffs we should all think about. They are fetched through a same-origin Worker endpoint and then cached aggressively so the page stays cheap to serve.

The HTML page includes live sidecar cards for napkin-math and CloudPing data. Agents should use the static source text here and the linked endpoints only if live values are needed.


On `/edge`, you can inspect what Cloudflare sees about the current request: colo, protocol, TLS, ASN, ISP, and whatever else is available on that connection. I find it pretty cool that this is exposed easily, it also helps if you are curious about your ISP or routing.


---

# Edge Inspector

Human page: https://irvyn.us/edge/

The edge inspector shows Cloudflare request metadata for the current connection, including colo, city, region, country, protocol, TLS version, ASN, organization, and request timestamp when Cloudflare provides those fields.

Raw JSON endpoint: https://irvyn.us/edge.json