Goyalayus

Notes, essays, and fragments from the edge of understanding.

leanings from my first RL training run or maybe runs

June 14, 2026

Original Notion page

feb 2026

my experience training 0.6b qwen model to play wordle and observing it learn to cheat and collapse

a month back, i decided to finally stop studying theory and get my hands dirty. so I made a cool looking project plan (optimized for twitter) and took it to tokenbender and he gave a profound advice

“when you are starting out, start with the basics. so that you get a hang of the controls, start by doing something like wordie”

I was not aware of Wordie, so first I searched that up and then I searched how people have gone about training models to play wordie. prime-intellect repo came up, I saw they used 1.7B Owen and the method they followed was first SFT —> RL

I was totally unaware of training frameworks out there, I knew vaguely that there is hugging face, native pytorch, unsloth, prime-intellect etc etc

I studied about all of these and what i understood was :

  • there was nothing like native pytorch, like you really don’t wanna do it

  • prime-intellect is mostly for RL and you can’t run it outside their environment which is not free

  • huggingface has a library called TRL which is mostly used

  • unsloth is basically a library which writes custom kernels to make LoRA fine-tuning faster and cheaper, so I decided to go with unsloth and just to start small I chose Qwen-0.6B

I knew we do not want to do full-fine tuning. it takes a lot of GPU, so I read lora and qlora origional papers not much leanings honestly papers. thinky machines have a paper called “lora without regret” it’s really good. the main takeaway was to apply lora only to mlp layers and not to the attention layers

another interesting thing was that i used to think lora and qlora are only for sft, never thought about them to apply on RL but now it’s pretty obvious

SFT

wilccbb had a Wordle dataset which they made out of running inference on their 1.7B Wordle model, they Rl’ed. I converted it into a format which would support sft and pushed it to huggingface

I started calculating the total vram that would be required to sft the model, first I tried doing it with GPT but it constantly miscalculated the activations so I had to do it myself

which was stupid in retrospect, just ask an llm how much memory it would take in total they have a good rough estimate or you can also use this website by NVIDIA

one cool thing I learnt is how sequence length is handled in SFT is fundamentally different then how it is done in pre-training

details (this is AI generated but pretty solid explanation tbh)

Pre-training vs SFT

SFT Pre-training
Chat-style messages (system, user, assistant) Raw text chunks Input
Varies per example Often fixed (e.g. 2048) Length
Often only on assistant tokens On all tokens (or causal mask) Loss
Optional (packing or padding) Common (fill context) Packing

What **seq_len / max_seq_length** Means in SFT

It’s the maximum total tokens per example (all messages concatenated):

[system] [user] [assistant]  →  tokenize  →  total tokens ≤ max_seq_length

If the tokenized example is longer than max_seq_length, it gets truncated.


How Length Is Handled

1. Truncation

When prompt + completion > max_seq_length:

  • Truncate from the left (most common): drop older context, keep the latest assistant reply.

  • Truncate from the right: drop the end of the completion (loses training signal).

  • Truncate both: e.g. trim prompt from the left and cap completion length.

Typical choice: truncate the prompt from the left so the model always trains on full assistant responses.

2. Padding

  • Shorter examples are padded (often with a pad token) to a common length.

  • Padding tokens are masked out of the loss so they don’t affect gradients.

  • Padding uses extra compute (you forward through tokens you ignore).

3. Packing (sequence packing)

  • Several short examples are concatenated into one sequence up to max_seq_length.

  • Attention is masked so tokens from different examples don’t attend to each other.

  • No padding; better GPU utilization.

  • Used by TRL, Unsloth, etc.

doing training was more or less straight forwards, I used unsloth.

this is a the loss curve I got

image.png

I was confused on when to stop training, Krishna helped me out here. he told me about validation loss. I knew about validation loss in general but I haven't thought about how it could be used in LLM training but it makes complete sense right

so train as long as your validation loss goes down

one thing that wasted a week for me was that the model did not learn to output and tokens, so I tried a lot of things like bigger model, different hyper parameters.

and found out a week later that it was just a parsing issue in my logging. (don’t judge me, DON’T JUDGE ME)

RL

I was very tired with google colab from the sft time, chinmay recommended to use lightning or modal and honestly I loved them. specially when you can ask cursor to ssh for you and run things there.

one mistake I did was to write code directly in the lightning code editor and run it from there, the better way is to write locally, push to github, pull in the training script itself and run directly on lightning and push the trained model to HF (github has limits)

another thing I did not quite knew was how addicting RL training runs are.

one thing I spent my two days over was how to do multi-turn RL, unsloth did not support it. so I was searching for frameworks like verl, RL2 etc but then I realized i can just make up wordle history so I do not need to do multi-turn RL

Technicalities

so I started with the same sft’d qwen .6B

my first reward function was pretty simple give format reward of .5 and reward of 1 if the model got the answer right

image.png

it didn’t work out, basically the rewards were too sparse for model to learn, model tried to learn for some time but then collapsed and never came back up

so I started giving it some additional rewards like

  1. accuracy rewards —> basically reward the model for outputting green and yellow letters in the guess it makes

  2. consistency rewards —> give rewards to the model for respecting the feedback from history

ways in which model tried to reward hack

any environment that can be reward hacked, will be reward hacked

  1. by overfitting and only outputting a single word which he learnt during fine tuning, because they it a mix of vowels and constants it would get some consistency rewards and some accuracy rewards and the model stuck there at local minima

  2. when i prevented this from happening by keeping some checks, model learnt to see words from the game history we gave it and output those and get some consistency rewards

  3. when I fixed that model started to output things like AABBC, EEDEB. these would get some consistency rewards and some accuracy rewards

after fixing all of this, this was my final curve

image.png

you can see a spike around step 300 right? that was when model tried to reward hack but because I have now identified what it was doing, reward models gave it punishment and you can see that in the graph right

how were the evals?

pretty shit, the model basically overfit to some words. later I got into thinking why is this the case like why is model reward hacking and not able to learn

the answer I found out was :

first of all how does grpo works? it does N rollouts in a single batch, out of N suppose M are the ones who perform good (got the reward), the policy is pushed to do more of that simple. now the assumption is that the model is using logic to get these, so pushing policy in that direction will make the model "more logical" but what if the model is not reasoning at all, its just acting to reason and guessing at the end. (which you can see in the thinking logs) then you would push the model to do more of guessing

basically the qwen .6 model fundamentally do not know how to do reasoning, it just pretends to use thinking tokens and sound like its reasoning but ultimately its guessing

reproducing prime-intellect run

all this got me thinking how was prime-intellect able to teach 1.7B model in wordle? clearly there is not like very big difference in 1.7B and .6B right?

so I decided to reproduce their training run, first I reproduced their SFT

two things stood out

  1. their batch size of 64 was too large compared to mine, i generally used 8

  2. they did not do lora, but they did full fine tune

the third thing that stood out while doing inference/basic-eval was how good 1.7B models were compared to .6B ones in thinking just after SFT

let me show you by an example

qwen 1.7b Okay, let's see. I need to figure out the next word to guess in Wordle. The user provided the feedback from the first guess, which was [crane], and the feedback was B B Y B Y. Let me break that down.

First, the feedback for each letter in the word is given. Let's look at each position:

For the first guess [crane], the feedback is:

  • Position 1: B (Black)

  • Position 2: B (Black)

  • Position 3: Y (Yellow)

  • Position 4: B (Black)

  • Position 5: Y (Yellow)

Wait, the user said the feedback is B B Y B Y. So that's five letters. Let me check each position:

  1. 'c' in [crane] was B (Black). So the first letter 'c' is not in the secret word.

  2. 'r' in [crane] was B (Black). So 'r' is not in the secret word.

  3. 'a' in [crane] was Y (Yellow). So 'a' is in the word, but not in position 3 (since position 3 is 'n' in the word? Wait, no, the secret word is 5 letters. Wait, the first guess was [crane], which is c r a n e. The feedback is B B Y B Y. So for each position:

Wait, the feedback is for each letter in the guess. So the first letter of the guess is c, and the feedback for that letter is B (Black), meaning c is not in the secret word. The second letter is r, feedback B (Black) again, so r is not in the secret word. The third letter is a, feedback Y (Yellow), meaning a is in the secret word but not in the third position (since the third position in the guess is a, but the feedback says yellow, which means it's present but in another position. Wait, no. Wait, the feedback for each letter is based on their position. Wait, no. The feedback for each letter is given as a combination of G, Y, or B. So for each letter in the guess, the feedback is given. So for the first guess [crane], the feedback is B B Y B Y. So for each position in the guess:

Position 1: c → B (Black) Position 2: r → B (Black) Position 3: a → Y (Yellow) Position 4: n → B (Black) Position 5: e → Y (Yellow)

Wait, but the feedback is for each letter in the guess. So the first guess is c r a n e. The feedback is B B Y B Y. So:

  • The first letter c is not in the secret word (B).

  • The second letter r is not in the secret word (B).

  • The third letter a is in the secret word (Y, so it's present but in a different position).

  • The fourth letter n is not in the secret word (B).

  • The fifth letter e is in the secret word (Y, so it's present but in a different position).

So the secret word has 'a' and 'e' as letters, but not in the third and fifth positions (since the third position in the guess is a, and the fifth is e). Wait, no. Wait, the feedback for each letter is based on their position in the guess. For example, if the third letter is a and the feedback is Y, that means the a is present in the secret word, but not in the third position. Wait, no. Wait, the Y (Yellow) feedback means that the letter is present in the word but in a different position. So if the third letter in the guess is a, and the feedback is Y, that means that a is in the secret word but not in the third position. Wait, but the feedback is for the entire word. So the Y for a in the third position would mean that a is in the secret word but not in the third position. Wait, no. Let me think again.

The feedback for each letter in the guess is:

  • If the letter is correct (G), then it's a green.

  • If the letter is present in the word but in a different position (Y), then it's a yellow.

  • If the letter is not in the word (B), then it's black.

So for the first guess [crane], the feedback is B B Y B Y. So:

  • The first letter c: B (not in the word)

  • Second letter r: B (not in the word)

  • Third letter a: Y (present in the word, but not in the third position)

  • Fourth letter n: B (not in the word)

  • Fifth letter e: Y (present in the word, but not in the fifth position)

Wait, but the third letter is a, and the feedback is Y. That means that a is in the secret word, but not in the third position. So the secret word has a, but not in position 3. Similarly, e is in the secret word but not in position 5.

So the secret word has a and e, but not in positions 1, 2, 4, or 5? Wait, no. Wait, the secret word has a in position 3 (since the third letter in the guess is a, and the feedback is Y, meaning that a is present in the word but not in position 3. Wait, that can't be. Because if the secret word has a, then in the third position, but the feedback is Y, meaning that the a is in the word but not in the third position. So the a is in the word, but not in position 3. Therefore, the secret word has a in another position (like position 1, 2, 4, or 5). But the first two letters of the guess are c and r, which are not in the word. The fourth letter n is not in the word. So the secret word has a and e, but not in positions 1, 2, 4, or 5? Wait, no. The secret word has a and e, but not in the third position (since the third position in the guess is a, and the feedback is Y, meaning that a is present in the word but not in the third position). Wait, that's a bit confusing. Let me try to think of the secret word.

Let me think of the secret word as a 5-letter word. The letters present are a and e (since a is in the third position of the guess, but the feedback is Y, so a is in the word but not in position 3. Wait, no. Wait, the third position of the guess is a, and the feedback is Y. So that means that the letter a is in the secret word, but not in the third position. So the a is in another position. Similarly, the fifth letter of the guess is e, and the feedback is Y. So e is in the secret word but not in the fifth position.

So the secret word has a and e, but not in positions 3 and 5. So possible positions for a: 1, 2, 4. Possible positions for e: 1, 2, 4.

Also, the first two letters of the guess (c and r) are not in the word. The fourth letter n is not in the word. So the secret word has letters a, e, and possibly other letters not in positions 1, 2, 4, or 5 (but that's not possible, since the secret word has 5 letters). Wait, the secret word has 5 letters. The letters in the secret word are a, e, and possibly others. But the letters in the secret word are:

  • a (in the word, but not in position 3)

  • e (in the word, but not in position 5)

  • Also, the letters in the secret word are c, r, n, but wait, no. Wait, the first two letters of the guess are c and r, which are not in the secret word. The fourth letter is n, which is not in the secret word. So the secret word does not have c, r, n.

So the secret word has a, e, and possibly other letters. Let's think about the letters that are in the secret word. The letters that are in the secret word are a, e, and any other letters not in the first two positions (c, r), fourth position (n), and fifth position (e). Wait, no. The secret word can have other letters. For example, maybe the secret word is something like [a e ...], but let's think.

So the secret word has a and e. Also, the letters in the secret word are:

  • a (but not in position 3)

  • e (but not in position 5)

  • Also, the letters in the secret word are not c, r, n, and possibly other letters.

So the possible letters in the secret word are a, e, and possibly others not in c, r, n, and the letters that are in the secret word but not in the positions mentioned.

Now, the next step is to make a guess. The user wants me to choose the next guess based on the feedback from the first guess.

So the current knowledge is:

  • The secret word has a and e.

  • a is not in position 3.

  • e is not in position 5.

  • c, r, n are not in the secret word.

Possible letters in the secret word: a, e, and possibly others. Let's think of possible letters. For example, the secret word could be something like [a x e y z], but that's not helpful. Let me think of possible words that fit.

Alternatively, think of the possible letters that could be in the secret word. Since the first guess is [crane], and the feedback is B B Y B Y, the secret word must have a and e, but not in positions 3 and 5. Also, the letters in the secret word are not c, r, n.

So possible letters in the secret word could be:

  • a (in position 1, 2, or 4)

  • e (in position 1, 2, 4)

  • Also, other letters not in c, r, n.

Now, the next guess should be a word that uses these letters and avoids the ones that are not in the secret word.

Possible letters that could be in the secret word: a, e, and maybe others like b, d, f, g, h, i, l, m, o, p, s, t, u, v, w, x, y, z, but considering that the secret word is a 5-letter word. Also, considering that the letters in the secret word can't be c, r, n.

So possible letters for the secret word: a, e, and other letters not in c, r, n.

Now, the next guess should be a word that uses these letters and possibly other letters that are not in the feedback. For example, the next guess should not include c, r, n, and avoid letters that are not in the secret word.

But the user hasn't given any new feedback yet. Wait, the user said that in turn 2, the next guess is to be made based on the feedback from turn 1. So the user has given the feedback for turn 1, which is B B Y B Y.

So the next guess should be a word that uses the letters that are in the secret word (a, e, and possibly others), and avoid letters that are not in the secret word.

But to make the next guess, I need to consider the possible letters that are in the secret word. For example, if the secret word has a and e, then the next guess could be a word that includes a and e, but not in positions 3 and 5.

Possible words:

  • Maybe [a e ...], but need to think of possible letters.

Alternatively, think of possible words that fit the constraints. For example, the secret word could be "a e ...", but need to think of other letters.

Alternatively, think of the possible letters that are not in the secret word. Since the secret word has a and e, and the first two letters are c and r (not in the word), and the fourth letter is n (not in the word), the secret word can't have c, r, n, and the letters in the secret word are a, e, and possibly others.

So possible letters for the secret word could include b, d, f, g, h, i, l, m, o, p, s, t, u, v, w, x, y, z, but considering that the secret word is a 5-letter word.

But this is getting too broad. Maybe the next guess should be a word that includes a and e, and avoid the letters that are not in the secret word.

Alternatively, think of possible words that have a and e, and are not c, r, n.

For example, the word "a e ...". Let's think of possible words:

  • "a e b l e" (but that's not a real word) Wait, maybe "a e ...". Let me think of possible 5-letter words that have a and e, and don't have c, r, n.

Possible words:

  • "a e e e e" – not a real word.

  • "a e e a e" – no.

  • "a e e m e" – no.

Alternatively, think of words with a and e, and other letters not in c, r, n.

For example, "a e b l e" – no. Maybe "a e d e e" – no.

Alternatively, think of words like "e a d e a" – no.

This is getting complicated. Maybe I should think of the next guess as a word that uses the letters a and e, and avoids c, r, n.

Another approach: the next guess should be a word that has a and e, and possibly other letters that are not in the secret word. But since the secret word has a and e, but not in certain positions, the next guess should be a word that includes a and e, but not in the positions that are not allowed.

Alternatively, think of possible words that have a and e, and are 5 letters. For example:

  • "a e b e a" – no.

  • "a e b e e" – no.

  • "a e m e a" – no.

Alternatively, think of words where a is in position 1, 2, or 4, and e is in position 1, 2, or 4.

For example, "a e b e e" – no. Wait, this is not working.

Maybe the next guess should be "a e ...", but I need to find a 5-letter word that includes a and e, and doesn't include c, r, n.

Another idea: the secret word could be "a e l e d" – but that's not a real word.

Alternatively, think of "a e x e x" – no.

Wait, maybe the secret word is "a e b l e" – but that's not a real word.

This is getting stuck. Maybe I need to consider that the next guess should be a word that uses the letters a and e, and other letters that are not in the secret word. But I need to think of a real word.

Alternatively, think of the possible letters that are not in the secret word. Since the secret word has a and e, and the letters c, r, n are not in the secret word, the next guess can include letters that are not in the secret word. Wait, no. The next guess should be a word that uses the letters that are in the secret word and avoid the ones that are not.

Alternatively, think of the next guess as a word that has a and e, and possibly other letters not in the secret word. But since the secret word is unknown, this is tricky.

Wait, perhaps the next guess should be a word that includes a and e, and has letters that are not in the secret word. But I'm not sure.

Alternatively, think of the next guess as a word that uses the letters a, e, and possibly others, but avoiding the letters that are not in the secret word.

Wait, this is getting too complicated. Maybe I should try to think of possible words that fit the constraints.

Let's consider that the secret word has a and e, and the next guess should be a word that includes a and e, and avoids c, r, n.

Possible words:

  • "a e x e x" – not real.

  • "a e b e x" – no.

  • "a e m e x" – no.

Alternatively, think of words like "a e l e e" – no.

Wait, maybe the secret word is "a e ...", but the next guess could be "a e ...". For example, "a e b e e" – no.

Alternatively, think of "a e l e d" – no.

This is not working. Maybe I need to think of another approach.

Since the first guess was [crane], and the feedback is B B Y B Y, the secret word has a and e, but not in positions 3 and 5. So the secret word could be something like [a x e x y], but that's not a real word.

Alternatively, think of the secret word as "a e b e e" – no.

Alternatively, think of the secret word as "a e l e e" – no.

This is really challenging. Maybe I should look for a word that includes a and e and has other letters not in c, r, n.

Wait, the secret word could be "a e m e a" – no.

Alternatively, "a e m e l" – no.

Alternatively, "a e m e t" – no.

Hmm. Maybe I need to consider that the secret word has a and e, and other letters that are not in the first two positions (c, r), fourth position (n), and fifth position (e). So the letters in the secret word are a, e, and possibly others like b, d, f, g, h, i, l, m, o, p, s, t, u, v, w, x, y, z.

So the next guess could be a word that includes a and e, and possibly other letters.

For example, the word "a e b e e" – no. Or "a e b d e" – no.

Alternatively, the word "a e b d e" – no.

Alternatively, think of a word like "a e b e a" – no.

This is really difficult. Maybe the next guess should be "a e l e e" – no.

Alternatively, maybe the secret word is "a e l e d" – no.

Wait, maybe the secret word is "a e l e e" – no.

Alternatively, think of the next guess as "a e l e e" – no.

This is not working. Maybe I need to think of a different approach.

Since the first guess [crane] has feedback B B Y B Y, the secret word has a and e, but not in positions 3 and 5. So the secret word could be something like [a x e y z], but I need to think of a real word.

Alternatively, think of the secret word as "a e x y z" – no.

Alternatively, think of the secret word as "a e x e x" – no.

This is really tough. Maybe I need to try a different approach. The user is asking for the next guess, so I need to choose a word that fits the constraints.

Let me try to think of possible 5-letter words that have a and e, and don't include c, r, n.

Possible words:

  • "a e x e x" – no.

  • "a e x e y" – no.

  • "a e x e m" – no.

  • "a e x e l" – no.

  • "a e x e t" – no.

  • "a e x e i" – no.

  • "a e x e o" – no.

  • "a e x e u" – no.

  • "a e x e s" – no.

Wait, maybe the secret word is "a e b e e" – no.

Alternatively, maybe the secret word is "a e b e a" – no.

This is not working. Maybe I need to consider that the secret word has a and e, and the next guess should be a word that includes a and e, and possibly other letters.

For example, the word "a e b e e" – no.

Alternatively, the word "a e b e d" – no.

Alternatively, the word "a e b e m" – no.

Hmm. Maybe I need to think of the next guess as "a e ...", but with a different letter.

Wait, maybe the next guess is "a e l e e" – no.

Alternatively, the next guess could be "a e m e e" – no.

This is really challenging. Maybe I should consider that the next guess is "a e l e e" – no.

Alternatively, think of a word like "a e l e e" – no.

Wait, maybe the secret word is "a e l e e" – but that's not a real word.

Alternatively, think of the word "a e l e m" – no.

I think I'm stuck. Maybe I should try a different strategy. Since the secret word has a and e, and the first two letters of the guess are c and r (not in the word), the next guess should avoid those letters. So the next guess should not include c, r, or n.

So the next guess can't have c, r, or n. So the next guess could be a word like "a e l e e" – no.

Alternatively, think of a word like "a e l e m" – no.

Wait, maybe the next guess is "a e l e e" – but that's not a real word.

Alternatively, the next guess could be "a e l e x" – no.

This is really difficult. Maybe the next guess should be "a e l e e" – but I don't know. Alternatively, think of the word "a e l e e" as "a e l e e" – no.

Wait, maybe the secret word is "a e l e e" – but that's not a real word.

I think I need to make a guess based on the available letters. Since the secret word has a and e, and the next guess should not include c, r, n, maybe the next guess is "a e l e e" – but I'm not sure. Alternatively, maybe the next guess is "a e l e x" – no.

Alternatively, think of the word "a e l e m" – no.

Wait, maybe the next guess is "a e l e e" – but I'm not sure. Alternatively, think of the word "a e l e e" as a possible guess.

But since I'm stuck, maybe I should make a guess like [a e l e e], even if it's not a real word. But that's not helpful. Alternatively, think of the word "a e l e e" as "a e l e e" – no.

Alternatively, think of the word "a e l e e" as "a e l e e" – no.

This is really tough. Maybe I should consider that the secret word has a and e, and the next guess could be "a e l e e" – but I'm not sure. Alternatively, think of the word "a e l e e" as a possible guess.

But I need to choose a word that fits the constraints. Maybe the next guess is "a e l e e" – even if it's not a real word.

Alternatively, think of the word "a e l e e" – no.

Wait, maybe the next guess is "a e l e e" – but I'm not sure. Alternatively, think of the word "a e l e e" as a possible guess.

But since this is a competitive game, the next guess should be a real word. Maybe the word is "a e l e e" – but that's not a real word. Alternatively, think of "a e l e e" as "a e l e e" – no.

I think I need to make a guess. Based on the constraints, the next guess could be [a e l e e], even if it's not a real word. But I'm not sure. Alternatively, think of a different word like [a e l e m] – no.

Wait, maybe the secret word is "a e l e e" – but that's not a real word. So I'm stuck.

Maybe I should try a different approach. Since the feedback for the first guess is B B Y B Y, and the secret word has a and e, the next guess should be a word that includes a and e, and avoid c, r, n. So possible letters are a, e, and others like b, d, f, g, h, i, l, m, o, p, s, t, u, v, w, x, y, z.

So possible words could be "a e b d e" – no. Or "a e b f e" – no.

Alternatively, the word "a e b f e" – no.

Alternatively, the word "a e b d e" – no.

This is really challenging. Maybe the next guess is "a e b d e" – no. But I'm not sure.

Alternatively, think of the word "a e b f e" – no.

I think I need to make a guess here. Based on the constraints, the next guess could be [a e b e e] – but I'm not sure. Alternatively, think of a word like [a e l e e] – but that's not a real word. Maybe the correct answer is [a e l e e], but I'm not sure.

Wait, maybe the next guess is [a e l e e], even if it's not a real word. But I need to choose a real word. Alternatively, think of the word "a e l e e" as "a e l e e" – no.

I think I need to make a guess based on the available letters. Maybe the next guess is [a e l e e] – but I'm not sure. Alternatively, think of the word "a e l e e" as a possible guess.

[a e l e e]

qwen 0.6b

me Okay, let's see. I need to figure out the 5-letter word based on the feedback from the first three turns. Let me start by looking at the feedback for each guess.

First, in Turn 1, the word was [crane], and the feedback was B B Y B Y. That means the word has a Y in position 1 (the second letter), a B in position 3 (third letter), and a Y in position 4 (fourth letter). The first letter (crane) is not correct, so it's B. The second letter is correct but not in the right position (since it's Y), and the third letter is correct but in the wrong position (B), and so on.

Then Turn 2, [alert], with feedback G B Y B B. So the correct word has a Y in position 3, but in the first position (a), it's correct. The second letter (e) is correct and in the right position (Y). The third letter (l) is correct. The fourth letter (t) is correct. The fifth letter (r) is correct. Wait, that's a lot of correct letters. But the actual word might have some letters that are correct but in different positions.

Wait, but the feedback for Turn 2 is G B Y B B. Let me parse that. Each position in the word is checked. The feedback says each letter is G, Y, B. So for each letter in the guess, if it's correct, it's G. If it's in the right position, Y. Otherwise, B.

So for Turn 1, [crane], feedback B B Y B Y. Let me think about the letters. The correct word has letters in positions 1-5. The feedback for each position:

Position 1: B (so letter is not correct) Position 2: B (not correct) Position 3: Y (correct letter in the word, but in the wrong position) Position 4: B (not correct) Position 5: Y (correct letter in the word, in position 5)

So the word must have a Y in position 3 and position 5. Also, the first two letters are not correct. The letters in positions 1 and 2 are B, so they are not part of the correct word. The third letter is Y, which is correct but in the wrong position. The fourth and fifth letters are B, meaning they are not part of the correct word.

So the correct word has letters in positions 3 and 5 as Y. Let me check Turn 3, [amaze], feedback G B G B Y. Let's see what that means. The feedback for each position:

Position 1: G (correct letter) Position 2: B (not correct) Position 3: G (correct) Position 4: B (not correct) Position 5: Y (correct)

So the correct word has letters in position 1 and 3 as Y. So combining this information.

So far, the correct letters are position 3 and 5. But we need to figure out the other letters. Let's check the feedback again.

Turn 1: feedback B B Y B Y. So positions 1,2,4 are B, meaning those letters are not in the word. Position 3 is Y, position 5 is Y. So the correct word has letters in positions 3 and 5 as Y. The other letters (positions 1, 2, 4) are not part of the word.

Turn 2: [alert], feedback G B Y B B. So positions 1 is G (correct), position 3 is Y (correct), position 5 is Y (correct). So the correct word has a Y in position 3 and 5. But also, position 1 is correct. So the correct word has letters in position 1 and 3 and 5 as Y. Let me check if there's any overlap.

Wait, but position 3 and 5 are both Y. So the correct word has Y in position 3 and 5. Also, position 1 is correct (G). So that's two correct letters. Let me check Turn 3, [amaze], feedback G B G B Y. So position 1 is G (correct), position 3 is G (correct), position 5 is Y (correct). So again, position 1, 3, 5 are correct. So combining this, the correct word has Y in positions 3 and 5. Position 1 is correct. So the word has letters in positions 1, 3, and 5 as Y. But since the word is 5 letters, that would mean that the correct word has Y in positions 3, 5, and one more position. Wait, but position 1 is also correct. So maybe the correct word has letters Y in positions 1, 3, and 5. But since each position is checked, the letters must be present. Let me think again.

Wait, in Turn 3, the feedback for position 1 is G, meaning that the first letter is correct. So the correct word has Y in position 1. Then position 3 is G, meaning correct letter. Position 5 is Y. So that would mean that the correct word has letters Y in position 1, 3, 5. But then, the other letters (positions 2 and 4) are B. So the word would be [Y][B][Y][B][Y], but that would make it YBYBY. But wait, that's only 5 letters. Wait, but in Turn 2, the feedback for position 1 is G, meaning correct letter. So if the word is [Y][B][Y][B][Y], then the correct word would be YBYBY. But that would mean that the word is "YBYBY". Let me check if that's possible.

But let's check the feedback from Turn 1. Turn 1's feedback is B B Y B Y. So the first two letters are B, meaning not correct. Position 3 is Y (correct), position 5 is Y (correct). Position 4 is B (not correct). So the correct word has Y in positions 3 and 5. Position 1 and 2 are not correct. Position 4 is not correct. So the correct word has letters Y in positions 3 and 5, and letters not in positions 1, 2, 4.

In Turn 2, [alert], feedback G B Y B B. So position 1 is G (correct), position 3 is Y (correct), position 5 is Y (correct). So that means position 1 is correct, position 3 is correct, and position 5 is correct. But position 1 was already marked as correct in Turn 1. So combining both, the correct word has Y in positions 1, 3, 5. So the word is Y in positions 1, 3, 5. That would make the word [Y][B][Y][B][Y]. But that's YBYBY. Let me check if that's possible.

But wait, let's check Turn 3, [amaze], feedback G B G B Y. So position 1 is G (correct), position 3 is G (correct), position 5 is Y (correct). So that would mean position 1, 3, 5 are correct. So combining all, the correct word has letters Y in positions 1, 3, 5. But then, what about the other letters? Position 2 and 4 are B, meaning those letters are not part of the word. So the correct word is [Y][B][Y][B][Y]. But that would mean that the correct word is "YBYBY". But that seems unlikely because in Turn 3, the feedback for position 1 is G, which would mean that the first letter is correct. But if the first letter is Y, then the word would be Y in position 1. Let's check if that's possible.

Alternatively, maybe there's another correct letter. Let me check again. Wait, in Turn 1, the feedback is B B Y B Y. So positions 1,2,4 are B. So those letters are not in the word. In Turn 2, the feedback for position 1 is G, meaning correct. So that's a correct letter. Position 3 is Y, correct. Position 5 is Y, correct. So that's three correct letters. So the correct word must have Y in positions 1,3,5. So the word is [Y][B][Y][B][Y]. But then, in Turn 3, the feedback for position 1 is G (correct), position 3 is G (correct), position 5 is Y (correct). So that would mean that the first letter is correct. But then, what about position 2? It's B, so that letter is not in the word. Position 4 is B, same thing. So the word is [Y][B][Y][B][Y]. But that would be YBYBY. Let me check if that's possible. However, the feedback for Turn 1 and Turn 3 both show Y in positions 3 and 5. So the correct word must have those letters. But then, in Turn 2, the feedback for position 1 is G, which would mean that the first letter is correct. So that's another Y. So combining all, the correct word is [Y][B][Y][B][Y]. But that's only 5 letters. Wait, but in Turn 3, the feedback for position 1 is G, which would mean that the first letter is correct. But if the first letter is Y, then the word is Y in position 1. So the correct word is YBYBY. Let me check if that makes sense.

But let's think again. The feedback from Turn 3 shows position 1 as G, which means the first letter is correct. So the correct word has Y in position 1. Then, position 3 is G (correct), position 5 is Y (correct). So that's two more Y's. So the correct word has three Y's. But the word is 5 letters, so the other two letters are B. So the word is [Y][B][Y][B][Y]. So that would be YBYBY. But wait, that's 5 letters. Let me check if that's possible. For example, the word could be "YBYBY". Let's see: Y in position 1, B in position 2, Y in 3, B in 4, Y in 5. But then, in Turn 3, [amaze], the feedback would be G B G B Y. Let's check: position 1 is G (correct), position 3 is G (correct), position 5 is Y (correct). So yes, that would be correct. But then, what about the other letters? Position 2 is B, so the second letter is not correct. But in Turn 2, [alert], the feedback is G B Y B B. Position 1 is G (correct), position 3 is Y (correct), position 5 is Y (correct). So that's also correct. So combining all, the correct word is [Y][B][Y][B][Y]. But that's 5 letters. But that seems like a possibility. But maybe there's another correct letter. Let me check again.

Alternatively, perhaps the correct word has Y in positions 3 and 5, but also another correct letter. Wait, in Turn 1, feedback for position 3 is Y, position 5 is Y. In Turn 2, position 3 is Y. In Turn 3, position 3 is G. So that suggests that the correct word has Y in position 3 and 5, and position 1 is correct. So three Y's. So the word is YBYBY. But let me check if that's possible. For example, the word could be "YBYBY". Let's check if that's correct. Let's see:

Word: YBYBY

Turn 1: [crane] feedback B B Y B Y. Let's check each position:

Position 1: C vs Y (B), so correct? No.

Position 2: R vs B (B), correct? No.

Position 3: A vs Y (Y), correct.

Position 4: N vs B (B), correct? No.

Position 5: E vs Y (Y), correct.

So feedback would be B B Y B Y. Which matches Turn 1's feedback. Great.

Turn 2: [alert] feedback G B Y B B. Let's check:

Position 1: A vs Y (G), correct.

Position 2: L vs B (B), correct? No, because it's supposed to be B. Wait, but the feedback is G B Y B B. Wait, position 2 is B, meaning that the second letter is correct but in the wrong position. So if the word is [alert], then position 2 is L. But the feedback for position 2 is B. Which suggests that L is correct. But in the correct word, position 2 is B. So this is conflicting. Wait, this suggests that my earlier conclusion is wrong. Because according to Turn 2, position 2 is B, which would mean that the second letter is not in the correct word. But according to Turn 3, the feedback for position 2 is B. But in Turn 2's feedback, the second letter is B. So the second letter is not in the word. Therefore, if the correct word is YBYBY, then in Turn 2, the guess [alert], the second letter (L) is not in the word, which matches the feedback. But in Turn 3, the feedback for position 2 is B, so that's correct. So that works. So the correct word is [Y][B][Y][B][Y]. But let's check Turn 3 again. [amaze], feedback G B G B Y. Let's check each position:

Position 1: A vs Y (G), correct.

Position 2: M vs B (B), correct? No, because it's supposed to be B. Wait, but the feedback is G B G B Y. So position 2 is B. That means that the second letter is correct but in the wrong position. But in the correct word, position 2 is B. So if the word is [Y][B][Y][B][Y], then the second letter is B. So the feedback for position 2 would be B, which matches. But then position 3 is G (correct), position 5 is Y. So that's correct. So [amaze] would have G B G B Y. Which matches Turn 3. So that works. So the correct word is [Y][B][Y][B][Y], which is YBYBY. But wait, that's only 5 letters. Let me check if that's possible. But then, what about the feedback from Turn 2? Let's see: [alert], feedback G B Y B B. So the correct word has Y in position 3 and 5. But position 1 is correct. So yes, that's correct. Therefore, the correct word is YBYBY.

But let me check if there's another possible word. For example, maybe the correct word has Y in position 3 and 5, and another correct letter. But according to the feedback, position 1 is correct (G), position 3 is correct (Y), position 5 is correct (Y). So that's three correct letters. But the word is 5 letters, so the other letters are B. So the word is [Y][B][Y][B][Y]. So that's the answer. Therefore, my next guess should be [Y][B][Y][B][Y], which is YBYBY. But wait, but the user's instruction says to make a guess based on the feedback. So the next guess would be [Y][B][Y][B][Y], which is YBYBY. Let me confirm once more.

Yes, that makes sense. So the correct word is YBYBY.

[Y][B][Y][B][Y]

and then I noticed that prime-intellect used 1024 as batch size and trained it for 200 steps what i used earlier in my qwen .6b runs was batch size of 8 and 500 steps, so due to compute issues I decided not to reproduce it

most important leaning

rent a faster gpu and use cursor

like the iteration times should not slow you down, the time to experiment should come down to a day or two.


so that was all for this time, if I get time. I will reproduce the prime-intellect run later.


edit 1

I read some papers on post-training such as deepseekmath , glm air and kimi k2 and kimi 2.5

what i learnt was that RL is good at improving maj@k but not pass@k what that means is if your sft model has accuracy zero, then doing RL on it will do nothing.

which was true for my sft qwen 0.6, so I decided to see and check how can I bring some accuracy .2/.3 to the sft base model.

the first thing i tried to do was run sft for longer, like 200 steps with batch size of 64 but it did not work out model still does not show any reasonable thinking traces.

I do not know why that is the case though, qwen-1.7B starts showing good thinking traces in minimal sft though