Show HN: Beating Pokemon Red with RL and <10M Parameters

drubinstein.github.io

181 points by drubs 4 days ago

Hi everyone!

After spending hundreds of hours, we're excited to finally share our progress in developing a reinforcement learning system to beat Pokémon Red. Our system successfully completes the game using a policy under 10M parameters, PPO, and a few novel techniques. With the release of Claude Plays Pokémon, now feels like the perfect time to showcase our work.

We'd love to get feedback!

levocardia 4 days ago

Really cool work. It seems like some critical areas (team rocket, safari zone) rely on encoding game knowledge into the reward function somehow, which "smuggles in" external intelligence about the game. A lot of these are related to planning, which makes me wonder whether you could "bolt on" an LLM to do things like steer the RL agent, dynamically choose what to reward, or even do some of the planning itself. Do you think there's any low-hanging fruit on this front?

Xelynega 4 days ago

For well-known games like "Pokemon Red" I wonder how much of that game knowledge would be "smuggled in" by an LLM in it's training data if you just replaced the external info in the reward function with it/used it to make up for other deficiencies.
I think they allude to this in their conclusion, but it's less about the low-hanging fruit and more about designing a system to feedback game dialogue into the RL decision making process in a way that can be mutated as part of the RL(be it an LLM or something else)
drubs 4 days ago

Wrote about this in the results section. I think there is a way to mix the two and simplify the rewards in the process. A lot of the magic behind getting the agent to teach and use cut probably could have been handled by an LLM.

rvz 4 days ago

Note: What makes this interesting is that this is a pre-LLM project which shows that in some projects you don't need an "LLM" for this. All you need is just a plain old reinforcement learning algorithm and a deep neural network which is perfect for this.

This is what I want to see more of and goes against the hype of LLMs. What a great RL project.

Meanwhile, "Claude" is still stuck somewhere in the game. Imagine the costs of running that vs this project.

mclau156 4 days ago

Claude 3.7 recently failed to finish Pokemon after getting stuck in a corner and deciding it was impossible to get out
- xinpw8 4 days ago
  
  not our agents a hierarchical approach would be superior. add rl to claude and it's gg

N_Lens 3 days ago

Wow nice work. 10M is a tiny model and I suspect this might be the future for specialised work. I can also imagine the progress towards AGI/ASI to have smaller models used as submodules.

brains basically have “modules” like this as well - neuronal columns that handle specialised tasks. For example when you’re driving on the road, the understanding whether the distance between you and the vehicle in front is increasing or decreasing is a finely tuned function of a specialised part of the brain.

novia 4 days ago

Please stream the gameplay to twitch so people can compare.

tehsauce 4 days ago

We have a shared community map where you can watch hundreds of agents from multiple peoples training runs playing in real time!
https://pwhiddy.github.io/pokerl-map-viz/
- Matthyze 4 days ago
  
  That's amazing. Really awesome work.
- novia 2 days ago
  
  Can you make a twitch stream of a single agent playing?
  - drubs a day ago
    
    Wouldn't make much sense. We generally train with 288 environments simultaneously. I've been thinking about ways to nicely stream all 288 environments though.

benopal64 4 days ago

Incredible work. I am just learning about PyBoy from your project, and it made me think of many fun ways to use that library to play Pokemon autonomously.

xinpw8 4 days ago

Very good to hear. Join the pyboy/pokemon discords! https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm

bubblyworld 4 days ago

What an awesome project! I'm curious - I would have thought that rewarding unique coordinates would be enough to get the agent to (eventually) explore all areas, including the key ones. What did the agents end up doing before key areas got an extra reward?

(and how on earth did you port Pokémon red to a RL environment? O.o)

drubs 4 days ago

The environments wouldn't concentrate enough in the Rocket Hideout beneath Celadon Game Corner. The agent would have the player wander the world reward hacking. With wild battles enabled, the environments would end up in Lavender Tower fighting Gastly.
> (and how on earth did you port Pokémon red to a RL environment? O.o)
Read and find out :)
- bubblyworld 4 days ago
  
  Thanks haha, I kept reading =D I see, so it's not just that you have to visit the key areas, they need to show up in the episodes enough to provide a signal for training.
  - drubs 4 days ago
    
    Yup!
wegfawefgawefg 4 days ago

you dont port it you wrap it. you can put anything in an rl environment. usually emulators are done with bizhawk, and some lua. worst case theres ffi or screen capture.
- bubblyworld 4 days ago
  
  Right, my thought was that this would be way too slow for episode rollout (versus an accelerated implementation in jax or something), but I guess not!
  - wegfawefgawefg 2 days ago
    
    well thats the golden issue with rl, sample efficiency. it is env bounded, so you want an architecture that extracts the max possible information from each collected sample, avoiding catastrophic forgetting, prioritizing samples according to relevance
- drubs 4 days ago
  
  My first version of this project 5 years ago involved a python-lua named pipe using Bizhawk actually. No clue where that code went

modeless 4 days ago

Can't Pokemon be beaten by almost random play?

drdeca 4 days ago

Judging by the “pi plays Pokemon Sapphire”, uh, not in a reasonable amount of time? It’s been at it for over 3 years, hasn’t gotten a gym badge yet, mostly stays in the starting town.
tehsauce 4 days ago

It's impossible to beat with random actions or brute force, but you can get surprisingly far. It doesn't take too long to get halfway through route 1, but even with insane compute you'll never make it even to viridian forest.
VertanaNinjai 4 days ago

It can be brute forced if that’s what you mean. It has a fairly low difficulty curve and these old games have a grid system for movement and action selections. That’s why they’re pointing out the lower parameter amount and CPU. The point I took away is doing more with less.
- xinpw8 4 days ago
  
  It definitely cannot be beaten using random inputs. It doesn't even get out of Pallet Town after billions of steps. We tested...
  - fancyswimtime 4 days ago
    
    the game has been beaten by fish
    
    cjbillington 3 days ago
    
    Based on the other examples of random inputs not being sufficient, I dare say the fish-based attempt may have been fraudulent.
    
    xinpw8 4 days ago
    
    dyor we only tested it with a pufferfish, courtesy of puffer.ai / pufferlib RL library. i promise it doesn't work with random inputs.
    
    gusgus01 4 days ago
    
    I'm not sure if you're just making a play on words, but I believe the commenter was talking about the streamer who sets up their fishtank to map to inputs and then let's their fish "play games". They beat pokemon sapphire supposedly. https://www.polygon.com/2020/11/9/21556590/fish-pokemon-sapp...
bloomingkales 4 days ago

The win condition of the game is the entire state of the game configured in a certain way. So there exists a lot of win conditions, you just have to do a search.
- xinpw8 2 days ago
  
  not sure what you mean..details?

kerkeslager 4 days ago

Are there any uses for AI yet that aren't either:

1. Doing things humans do for fun. 2. Doing things that AI is horribly terrible at.

drubs 4 days ago

There's a ton of applications for AI. Back when I was at Spotify, I co-authored Basic Pitch (https://basicpitch.spotify.com/), an audio-to-midi library. There are a ton of uses for AI outside of what's heavily publicized.
sadeshmukh 4 days ago

Medical field, spotting things
Autonomous drones
Financial fraud detection
Scheduling of trains/buses/etc
I personally do like chatbots but you probably don't
- xinpw8 4 days ago
  
  the only chatbot for me is smarterchild
  - bigfishrunning 3 days ago
    
    I feel like that sentence aged me.
    
    xinpw8 2 days ago
    
    Thank you. Because I was just shaking my head "kids these days"

throwaway314155 4 days ago

Awesome! Why do you think the reward for reading signs helped? I'm assuming the model doesn't gain the ability to read and understand english just from RL, so what purpose does it serve other than to maybe waste ticks on signs that ultimately don't need to be read?

drubs 4 days ago

It's silly, but signs were a way to incentivize the agent to explore deeper into the Safari Zone among other areas.

jononor 4 days ago

Very nice! Nice to see demonstrations of reinforcement learning being used to solve non-trivial tasks.

KeplerBoy 3 days ago

Really missing the arxiv link. The whole page reads like the arxiv link should be in the next paragraph, but it never appeared.

differintegral 4 days ago

This is very cool, congrats!

I wonder, does anyone have a sense of the approximate raw number of button presses required to beat the game? Mostly curious to see how that compares to the parameter count.

tarentel 4 days ago

I imagine < 10000. https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen... and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this is something like 200k and is a slightly different game. Quite a bit less than 10m either way.

worble 4 days ago

Heads up, clicking "Next Page" just takes you to an empty screen, you have to use the navigation links on the left if you want to get read past the first screen.

drubs 4 days ago

Thanks for the heads up. I just pushed a fix.
- worble 4 days ago
  
  I think you fixed the one below the puffer.ai image, but not the one above Authors.
  - drubs 4 days ago
    
    and...fixed!
    
    xinpw8 4 days ago
    
    i am sorry for my awful qa on the site :((((((((((((

bee_rider 4 days ago

Ah, very neat.

Maybe some day the “rival” character in Pokemon can be played by a RL system, haha. That way you can have a “real player (simulated)” for your rival.

xinpw8 4 days ago

a cool idea, except that battling actually doesn't even matter to the ai. if you look at what the agent is doing during a battle, it is sort of spamming options + picking damaging attacks. it would be a stretch to say that agents were 'good' at battling...
- wegfawefgawefg 4 days ago
  
  if youve done the work to to make the rival rl based and have the ability to go around youd probably have added basic battle controls
  - xinpw8 4 days ago
    
    as it stands, battling is wholly unimportant to completing the game, as long as the agents can eventually complete the trainer battles mandatory for plot advancement. it's funny because everyone thinks about battling when they think about pokemon. my first fn i wrote, back when we were still bumping around pallet town, was a battle reward function. it was trash and didn't work and was over-complicated. the crux of the problem is exploration over a vast, open-world map, and completion of the sundry storyline tasks at distal parts of said map in the correct sequence without the policy collapsing and without agents overfitting to, say, overworld loops.
    
    wegfawefgawefg 3 days ago
    
    you missed my point.
    I know all about rl. Ive read go-explore 1/2, and I have personally implemented intrinsic curiosity.
    I was just commenting on what rhe other person said, which is that it would be cool to have the npcs be agents that battle and train too, to which you said they could not be made to, to which I say, we have the technology. :)
    
    drubs 3 days ago
    
    Sounds cool to me.

xinpw8 4 days ago

This is a first-in-world, isn't it?

nimish 4 days ago

Considering how many things are less complicated than Pokemon, this is very cool

endofreach 4 days ago

> Pokémon Red takes 25 hours on average for a new player to complete.

Seriously? I've never really played video games, but i remember spending so much time on pokemon red when i was young. Not sure if i ever really finished more than once. But i'm pretty sure i must have played for more than 50h or so before even close to finish. My memory might trick me though.

Not sure which pokemon version it was, but i got so hooked trying to get this "secret" pokemon which was just a bunch of pixels. Some kind of bug (of the game, not the type of pokemon). You had to do specific things in a park and other things and then surf up and down x-times on the right shore of an island... or something like that. I had no idea how it worked and got so hooked, i must have spent most of my playing time on things like that.

Oh boy, memories...

ludicity 4 days ago

It definitely took me way more than 25 hours as a kid to beat Pokemon Blue! But I was so young that I didn't understand that "Oak: Hello!" meant that someone called Oak was talking.
The glitched Pokemon you're talking about is Missingno by the way! I remember surfing up and down Cinnabar Island to do the same thing.
- xinpw8 4 days ago
  
  i had to look up how to do cut. like, i was hard-stuck.
- endofreach 4 days ago
  
  Awesome! Missingno was what i meant. Thank you!
Uehreka 4 days ago

There’s a guy on Youtube named JRose11 who is on a quest to beat Pokemon Red with all 151 of the original Pokemon individually. He’s about 100 Pokemon in at this point. He doesn’t use crazy speedrunning tactics (he wants to approximate a normal-ish playthrough) but because he knows exactly where to go, what to do and what’s skippable almost all of his runs are under 10 hours (many are under 6 and he did it with Mewtwo in just under 2).
oreally 4 days ago

The estimates seem to be in today's reported numbers based off howlongtobeat. Back in the day it was intended to last 60hours iirc.

mclau156 4 days ago

Could you have used the decompilations of pokemon on github? https://github.com/pret/pokered

drubs 4 days ago

There's an entire section on how the decompilations were used :)
- mclau156 4 days ago
  
  Ok sorry I thought maybe there was a chance that the decomp project could edited in a way that would create a ROM that allowed RL to be done easier, but it seems like it just came in handy for looking up values along with the GB ASM tutorial, the alternative of my thought process is re-creating pokemon red in a modern language which you also mentioned
xinpw8 4 days ago

if you helped with pret then god bless you