Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲llm.c: multi-GPU, bfloat16, flash attention, ~7% faster than PyTorch (twitter.com)

104 points by tosh 1 days ago | 6 comments

pama 18 hours ago [-]

Much faster yet than stable pytorch 2.3 (46% on A100, as per the tweet), and much much faster yet compared to pytorch 2.2, which was the stable version a couple weeks ago. Also llm.c is much faster yet when the performance comparison is on H100 instead of A100, or on multiple GPU instead of a single one.

gpapilion 23 hours ago [-]

I’d be happier with 93% of PyTorch but works on multiple gpu manufacturers.

tyfighter 2 hours ago [-]

Yeah, I'm sure that's what anyone trying to build some kind of AI startup that's managed to acquire a small handful of A100 or even better H100s thinks too. "Those cards sure were expensive, but ethically, I'd rather the software run slower to give me future imaginary options than to get the most out the hardware I just bought."

reallymental 19 hours ago [-]

That... wasn't the original intention of the project. It was to create a C version of the PyTorch code that could train GPT-2.

michaelgiba 6 hours ago [-]

it’s pretty impressive that PyTorch is only 7% slower than this given it can be used so generally

Olesya000 1 days ago [-]

[dead]

ein0p 19 hours ago [-]

Crated over the period of like 4 weeks by random people all over the internet

Rendered at 23:18:47 GMT+0000 (Coordinated Universal Time) with Vercel.