Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Fine tune LLAMA3 on million scale dataset in consumer GPU using QLora, DeepSpeed (medium.com)

145 points by mehulashah 9 days ago | 25 comments

unraveller 9 days ago [-]

This is a thorough "how to" but it is missing a "why for" about any of the chosen starting elements.

I don't understand why you would use an old dataset that worked for llama2 and just fine-tune llama3 on it. Isn't it most likely that the new model has covered off everything it missed last time around and now the last dataset is only valuable for the last gen.

factorymoo 9 days ago [-]

This might be an unfair statement but it really feels like all of these blogs don't know why. They copy/paste each other (you often seem the same errors in multiple notebooks/blogs) and I have a feeling no one really deeply understands what they're doing.

unraveller 9 days ago [-]

Found my answer for why thanks to the issues in latest dolphin fine-tune. They do these types of fine tunes mainly to reduce refusal rates and increase intelligence. They did the knee-jerk rerun of the same old data this time, as I suspected, just for lols to see where open-source is at.

Spoiler alert, fine-tunes won't be better until the data quality is better than meta's instruction fine-tune. Give it some weeks.

Why does [doplin-l3-8B] perform substantially worse in some tests?

Essentially, it's trained like this:

  LLama-3-8B-base_model --> LLama-3-8B-Instruct
  LLama-3-8B-base_model --> dolphin-2.9-llama3-8B

And not like this:

  LLama-3-8B-Instruct --> dolphin-2.9-llama3-8B

https://huggingface.co/cognitivecomputations/dolphin-2.9-lla...

jackblemming 9 days ago [-]

Most of the entire field of machine learning is “try shit and see what works”. So it seems like they’re par for the course.

v3ss0n 9 days ago [-]

Same as software engineering field too.

littlestymaar 9 days ago [-]

It's even worse for AI given that nobody really understands why anything works.

sinuhe69 9 days ago [-]

I wonder what we don’t understand from the SE POV?

ijk 9 days ago [-]

One additional problem with people who write breathless tutorials about doing things with AI is that they are more likely than average to have been written with ChatGPT. Which, given the knowledge cutoff for most models, is not where I'd personally turn for data on recent technical developments, but is par for the course for the kind of low-effort copy-paste bloggers doing it for attention.

This particular one seems to be from someone who is documenting their learning process, which is a valuable contribution but, obviously, not a source of great authority on the how's and why's.

sa-code 9 days ago [-]

Thank you for saying this! The number of people that would need to fine tune vs just using RAG is really small. People that are not familiar with the source often jump to fine tuning as an option

Foobar8568 9 days ago [-]

I am still unsure where to stand on this fine tuning vs rag. I feel that for live data, rag would be preferable but for daily/weekly updated one, then fine tuning.

Another aspect where I am unsure is multi user for a model e.g. can we have concurrency for a model or the queries have to be queued.

bigfudge 9 days ago [-]

Fine tuning doesn’t ’add content’ the way RAG does though. They’re not really comparable in that way.

Foobar8568 9 days ago [-]

So more to be optimized for specific tasks in a domain ?

blackoil 9 days ago [-]

Dataset may not be public. All large companies have millions of internal documents. Internal LLM can be trained on them.

bradfox2 9 days ago [-]

Qlora won't work well to add knowledge via private data.

Parameter efficient methods are not useful for these cases at the 8b scale without a more complex training procedure that periodically merges back adapters. Maybe at the 70B scale.

tpurves 9 days ago [-]

What scale of company do you need to be to actually be able afford and get return on investment on retraining base models with your own proprietary knowledge and docs? Considering also the implications of continually retraining?

sdesol 9 days ago [-]

I was under the impression that you wouldn't. If you want access to proprietary knowledge, you would use RAG + LLM.

bradfox2 9 days ago [-]

The only experience I have is first hand, what my company is doing for our client base. We are doing continuous pretraining and the rest of the alignment stack training on about 10B private tokens + private customer data to produce private custom models for companies in the 500 to 3000 employee range. We built and operate a single rack cluster that cost mid 6 figures in order to be able to do this.

These models get combined with rag for highly specific technical doc authoring and other uses.

tpurves 9 days ago [-]

This is very helpful context on what works right now, thanks for sharing.

littlestymaar 9 days ago [-]

I don't think anyone has the answer to this question yet.

anonymousDan 9 days ago [-]

Can you point to any literature on this by any chance? I would be really interested to see some in depth analysis.

imjonse 9 days ago [-]

The "why for" is usually learning/gaining experience/FOMO.

TOMDM 9 days ago [-]

For the human or the LLM?

sumandas0 9 days ago [-]

[dead]

sumandas0 9 days ago [-]

[dead]

sumandas0 9 days ago [-]

[dead]

iAkashPaul 9 days ago [-]

With unsloth's optimizations you can do llama-3-8b's QLoRA fine-tuning on your 8GB card(mine's a 2070S) with 900MB to spare with BS of 4.

em1sar 9 days ago [-]

[dead]

SunlitCat 9 days ago [-]

Since the crypto (currency) craze of 2017, every time I hear "consumer GPU" somewhere in a story that has nothing to do with gaming, it sends a chill down my spine.

j0hnyl 7 days ago [-]

RIP your spine for the foreseeable future.

Rendered at 11:36:32 GMT+0000 (Coordinated Universal Time) with Vercel.