Show HN: Beating Pokemon Red with RL and <10M Parameters
drubinstein.github.ioHi everyone!
After spending hundreds of hours, we're excited to finally share our progress in developing a reinforcement learning system to beat Pokémon Red. Our system successfully completes the game using a policy under 10M parameters, PPO, and a few novel techniques. With the release of Claude Plays Pokémon, now feels like the perfect time to showcase our work.
We'd love to get feedback!
Really cool work. It seems like some critical areas (team rocket, safari zone) rely on encoding game knowledge into the reward function somehow, which "smuggles in" external intelligence about the game. A lot of these are related to planning, which makes me wonder whether you could "bolt on" an LLM to do things like steer the RL agent, dynamically choose what to reward, or even do some of the planning itself. Do you think there's any low-hanging fruit on this front?
For well-known games like "Pokemon Red" I wonder how much of that game knowledge would be "smuggled in" by an LLM in it's training data if you just replaced the external info in the reward function with it/used it to make up for other deficiencies.
I think they allude to this in their conclusion, but it's less about the low-hanging fruit and more about designing a system to feedback game dialogue into the RL decision making process in a way that can be mutated as part of the RL(be it an LLM or something else)
Wrote about this in the results section. I think there is a way to mix the two and simplify the rewards in the process. A lot of the magic behind getting the agent to teach and use cut probably could have been handled by an LLM.
Note: What makes this interesting is that this is a pre-LLM project which shows that in some projects you don't need an "LLM" for this. All you need is just a plain old reinforcement learning algorithm and a deep neural network which is perfect for this.
This is what I want to see more of and goes against the hype of LLMs. What a great RL project.
Meanwhile, "Claude" is still stuck somewhere in the game. Imagine the costs of running that vs this project.
Claude 3.7 recently failed to finish Pokemon after getting stuck in a corner and deciding it was impossible to get out
not our agents a hierarchical approach would be superior. add rl to claude and it's gg
Wow nice work. 10M is a tiny model and I suspect this might be the future for specialised work. I can also imagine the progress towards AGI/ASI to have smaller models used as submodules.
brains basically have “modules” like this as well - neuronal columns that handle specialised tasks. For example when you’re driving on the road, the understanding whether the distance between you and the vehicle in front is increasing or decreasing is a finely tuned function of a specialised part of the brain.
Please stream the gameplay to twitch so people can compare.
We have a shared community map where you can watch hundreds of agents from multiple peoples training runs playing in real time!
https://pwhiddy.github.io/pokerl-map-viz/
That's amazing. Really awesome work.
Can you make a twitch stream of a single agent playing?
Wouldn't make much sense. We generally train with 288 environments simultaneously. I've been thinking about ways to nicely stream all 288 environments though.
Incredible work. I am just learning about PyBoy from your project, and it made me think of many fun ways to use that library to play Pokemon autonomously.
Very good to hear. Join the pyboy/pokemon discords! https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm
What an awesome project! I'm curious - I would have thought that rewarding unique coordinates would be enough to get the agent to (eventually) explore all areas, including the key ones. What did the agents end up doing before key areas got an extra reward?
(and how on earth did you port Pokémon red to a RL environment? O.o)
The environments wouldn't concentrate enough in the Rocket Hideout beneath Celadon Game Corner. The agent would have the player wander the world reward hacking. With wild battles enabled, the environments would end up in Lavender Tower fighting Gastly.
> (and how on earth did you port Pokémon red to a RL environment? O.o)
Read and find out :)
Thanks haha, I kept reading =D I see, so it's not just that you have to visit the key areas, they need to show up in the episodes enough to provide a signal for training.
Yup!
you dont port it you wrap it. you can put anything in an rl environment. usually emulators are done with bizhawk, and some lua. worst case theres ffi or screen capture.
Right, my thought was that this would be way too slow for episode rollout (versus an accelerated implementation in jax or something), but I guess not!
well thats the golden issue with rl, sample efficiency. it is env bounded, so you want an architecture that extracts the max possible information from each collected sample, avoiding catastrophic forgetting, prioritizing samples according to relevance
My first version of this project 5 years ago involved a python-lua named pipe using Bizhawk actually. No clue where that code went
Can't Pokemon be beaten by almost random play?
Judging by the “pi plays Pokemon Sapphire”, uh, not in a reasonable amount of time? It’s been at it for over 3 years, hasn’t gotten a gym badge yet, mostly stays in the starting town.
It's impossible to beat with random actions or brute force, but you can get surprisingly far. It doesn't take too long to get halfway through route 1, but even with insane compute you'll never make it even to viridian forest.
It can be brute forced if that’s what you mean. It has a fairly low difficulty curve and these old games have a grid system for movement and action selections. That’s why they’re pointing out the lower parameter amount and CPU. The point I took away is doing more with less.
It definitely cannot be beaten using random inputs. It doesn't even get out of Pallet Town after billions of steps. We tested...
the game has been beaten by fish
Based on the other examples of random inputs not being sufficient, I dare say the fish-based attempt may have been fraudulent.
dyor we only tested it with a pufferfish, courtesy of puffer.ai / pufferlib RL library. i promise it doesn't work with random inputs.
I'm not sure if you're just making a play on words, but I believe the commenter was talking about the streamer who sets up their fishtank to map to inputs and then let's their fish "play games". They beat pokemon sapphire supposedly. https://www.polygon.com/2020/11/9/21556590/fish-pokemon-sapp...
The win condition of the game is the entire state of the game configured in a certain way. So there exists a lot of win conditions, you just have to do a search.
not sure what you mean..details?
Are there any uses for AI yet that aren't either:
1. Doing things humans do for fun. 2. Doing things that AI is horribly terrible at.
?
There's a ton of applications for AI. Back when I was at Spotify, I co-authored Basic Pitch (https://basicpitch.spotify.com/), an audio-to-midi library. There are a ton of uses for AI outside of what's heavily publicized.
Medical field, spotting things
Autonomous drones
Financial fraud detection
Scheduling of trains/buses/etc
I personally do like chatbots but you probably don't
the only chatbot for me is smarterchild
I feel like that sentence aged me.
Thank you. Because I was just shaking my head "kids these days"
Awesome! Why do you think the reward for reading signs helped? I'm assuming the model doesn't gain the ability to read and understand english just from RL, so what purpose does it serve other than to maybe waste ticks on signs that ultimately don't need to be read?
It's silly, but signs were a way to incentivize the agent to explore deeper into the Safari Zone among other areas.
Very nice! Nice to see demonstrations of reinforcement learning being used to solve non-trivial tasks.
Really missing the arxiv link. The whole page reads like the arxiv link should be in the next paragraph, but it never appeared.
This is very cool, congrats!
I wonder, does anyone have a sense of the approximate raw number of button presses required to beat the game? Mostly curious to see how that compares to the parameter count.
I imagine < 10000. https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen... and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this is something like 200k and is a slightly different game. Quite a bit less than 10m either way.
Heads up, clicking "Next Page" just takes you to an empty screen, you have to use the navigation links on the left if you want to get read past the first screen.
Thanks for the heads up. I just pushed a fix.
I think you fixed the one below the puffer.ai image, but not the one above Authors.
and...fixed!
i am sorry for my awful qa on the site :((((((((((((
Ah, very neat.
Maybe some day the “rival” character in Pokemon can be played by a RL system, haha. That way you can have a “real player (simulated)” for your rival.
a cool idea, except that battling actually doesn't even matter to the ai. if you look at what the agent is doing during a battle, it is sort of spamming options + picking damaging attacks. it would be a stretch to say that agents were 'good' at battling...
if youve done the work to to make the rival rl based and have the ability to go around youd probably have added basic battle controls
as it stands, battling is wholly unimportant to completing the game, as long as the agents can eventually complete the trainer battles mandatory for plot advancement. it's funny because everyone thinks about battling when they think about pokemon. my first fn i wrote, back when we were still bumping around pallet town, was a battle reward function. it was trash and didn't work and was over-complicated. the crux of the problem is exploration over a vast, open-world map, and completion of the sundry storyline tasks at distal parts of said map in the correct sequence without the policy collapsing and without agents overfitting to, say, overworld loops.
you missed my point.
I know all about rl. Ive read go-explore 1/2, and I have personally implemented intrinsic curiosity.
I was just commenting on what rhe other person said, which is that it would be cool to have the npcs be agents that battle and train too, to which you said they could not be made to, to which I say, we have the technology. :)
Sounds cool to me.
This is a first-in-world, isn't it?
Considering how many things are less complicated than Pokemon, this is very cool
> Pokémon Red takes 25 hours on average for a new player to complete.
Seriously? I've never really played video games, but i remember spending so much time on pokemon red when i was young. Not sure if i ever really finished more than once. But i'm pretty sure i must have played for more than 50h or so before even close to finish. My memory might trick me though.
Not sure which pokemon version it was, but i got so hooked trying to get this "secret" pokemon which was just a bunch of pixels. Some kind of bug (of the game, not the type of pokemon). You had to do specific things in a park and other things and then surf up and down x-times on the right shore of an island... or something like that. I had no idea how it worked and got so hooked, i must have spent most of my playing time on things like that.
Oh boy, memories...
It definitely took me way more than 25 hours as a kid to beat Pokemon Blue! But I was so young that I didn't understand that "Oak: Hello!" meant that someone called Oak was talking.
The glitched Pokemon you're talking about is Missingno by the way! I remember surfing up and down Cinnabar Island to do the same thing.
i had to look up how to do cut. like, i was hard-stuck.
Awesome! Missingno was what i meant. Thank you!
There’s a guy on Youtube named JRose11 who is on a quest to beat Pokemon Red with all 151 of the original Pokemon individually. He’s about 100 Pokemon in at this point. He doesn’t use crazy speedrunning tactics (he wants to approximate a normal-ish playthrough) but because he knows exactly where to go, what to do and what’s skippable almost all of his runs are under 10 hours (many are under 6 and he did it with Mewtwo in just under 2).
The estimates seem to be in today's reported numbers based off howlongtobeat. Back in the day it was intended to last 60hours iirc.
Could you have used the decompilations of pokemon on github? https://github.com/pret/pokered
There's an entire section on how the decompilations were used :)
Ok sorry I thought maybe there was a chance that the decomp project could edited in a way that would create a ROM that allowed RL to be done easier, but it seems like it just came in handy for looking up values along with the GB ASM tutorial, the alternative of my thought process is re-creating pokemon red in a modern language which you also mentioned
if you helped with pret then god bless you