Solving Puzzles Faster Than Humanly Possible
Conventional wisdom suggests that rushed engineering is sloppy engineering. Let’s find some low stakes, and put that theory to the test!
The Opus Magnum 24-Hour Challenge
I’ve blogged many times already about the recreational engineering game that is Opus Magnum. This game will be our testing ground. Specifically, the player known as “panic” has created a program to procedurally create solvable Opus Magnum custom puzzles in bulk. These are not tournament-quality puzzles with clever gimmicks (a post on the recent tournament is in progress, by the way!) but rather an assortment of things that can be solved. The gif here is an example of one of these puzzles; I solved it without much care for optimization, though I am naturally cycles-brained.
Find a zip file of 100 of these puzzles on panic’s website at the link here: http://critelli.technology/24hour-1-sample.zip. The “critelli.technology” domain should be familiar to those who have seen other events in Opus Magnum or used community tooling to merge or mirror their solutions. Also present through this domain is a python library http://critelli.technology/om.py that can be used to parse puzzle files and build and simulate solution files. Then on two important dates, there will be new puzzle drops.
June 2
On Sunday, June 2, 2024 at 00:00 GMT, panic will release a new set of puzzles, this time 1000 of them, at http://critelli.technology/24hour-1-test.zip. For 24 hours following this drop, you can send zips of solution files to panic at ian@ianhenderson.org. He will use the om library to run and score them, evaluating their Cost, Cycles and Area as metrics. All of the infrastructure is there to efficiently construct a high volume of data about a potentially large number of submissions.
And it is going to be a large number. Consider, 24 hours for 1000 puzzles means it is not humanly feasible for one person to solve them all by hand.
October 20
The official 24 hour challenge will happen on Sunday, October 20th, 2024 at 00:00 GMT, at http://critelli.technology/24hour-1-puzzles.zip. It will work the exact same as the June 2 one, but with more time to prepare.
But, prepare what?
Well the intention here, is to push people to make automated puzzle solvers. The 100 puzzles already given are test fodder for the bots, and the June 2nd date helps test the full end to end process of puzzle drop to deadline.
Prior Art
Opus Magnum was released in December 2017. In June 2018, “gtw123” shared the video shown here. In the video, an automated solver builds solutions for Refined Gold, Water Purifier, and a particular custom puzzle containing every atom type in the output.
The source code for gtw123’s solver is found at https://github.com/gtw123/OpusSolver.
One key difference versus the 24 hour challenge, is the fact that gtw123’s bot is building its solutions directly in the game. It reads the content of the inputs and outputs by placing atoms directly and doing image recognition. It manipulates the UI to place parts and instructions, and stops when the solution is built, expecting the user to then run it.
However, as an intermediate step, it is clearly generating a blueprint for a solution to an already-analyzed puzzle. That step is the only thing happening in the 24 hour challenge. Using all of the infrastructure panic has built, the graphics and UI interaction are no longer part of the problem. UI interaction also takes up a lot of runtime in the video, while the conversion from puzzle to blueprint appears nearly instant. Running 1000 cases of just creating a solution file could be a very quick process, a matter of seconds.
But it’s clearly not that exciting to just clone gtw123’s bot, extract the solution method, and complete the 24 hour challenge. It might be a good idea for one person to do this, ideally gtw123. The goal though, is to come up with new methods. From the minds of the community, automate solving the puzzles more optimally for the main metrics.
The Alternative
What if people solved all the puzzles? A team of humans, devoted to making sure that humanity would come out on top? This is the goal of Zorflax and “Team Nobots”. It would take a team effort to solve 1000 puzzles in 24 hours, optimally or not. Crowdsourcing is hard because it takes a decent amount of skill to create solutions. So if we give 8 hours per person, how many skilled players would it take?
For humans, we get bottlenecked by UI interaction again. Look at the current any% world record for speedrunning the campaign, by rebix:
This represents the fastest reasonable solution creation speed for a human. In the case of a level with significant complexity, it takes between 30 seconds and 1 minute to build and run a solution:
That’s something like 12 hours of UI interaction for 1000 puzzles. If humans are to be building separate solutions to minimize Cost, Cycles, or Area, then it’s functionally 3000 puzzles. And likely more than 36 hours, because of the types of solutions expected.
When rebix and others route speedsolves, the actual machines are not optimized for any of the major metrics. The only consideration is ease of building and memorizing. When trying to optimize any metric, the UI interaction time becomes far higher. Cost and Area optimized solutions have hundreds of instructions, especially if they require purifying metals by several levels. I would feel comfortable putting a lower bound of 100 hours of UI interaction to make the full set of solutions. This means over a dozen players putting in a full day. Assuming that nobody has to think too hard.
What About Thinking?
There have been speedsolving competitions in the past, where a player must solve puzzles quickly with no outside aid. The timer starts when the players first have the ability to see and download the puzzle, and ends when they submit it to the events website. All design, building, and programming, is on the clock.
The fastest solutions usually come in at around 3-4 minutes for simple puzzles and 5-10 for complicated ones. Here’s my 8 minute solution to “Berlo’s Dualism” that had a task to replace central salt by vitae and mors.
Reducing thinking meant creating waste polymers even though it wasn’t specifically necessary, as well as building two mirror image halves with click-drag copying, and mass-offsetting their instructions by 2 cycles to reuse complex things already built.
Gimmicky puzzles could take as long as 15 min, as was the case for “Galvanization”, a puzzle where the only permitted instructions were grab, rotate, drop. No pivots, no pistons, no track. I built the below in around 17 minutes.
Reasonable Puzzle Design
Fortunately, for the 24 hour challenge, there is a guarantee that puzzles will permit all instructions and arm types. Bonding and unbonding will always be available, and there will be no gimmick bonds like triplex between non-fire atoms.
But still, the puzzles have some complexity and the goal involves optimizing. So it may take an hour for each puzzle to get all 3 metrics to a comfortably low score. This is starting to look more like 100 dedicated solvers, which is bordering on unreasonable in such a small community.
But all they need to beat is the bots right? Who is to say the bots aren’t all going to be making horribly unoptimized solutions that even the most naive speedsolver could beat on their first effort? Well, that’s what we are getting out of this experiment. Figuring out reasonable expectations on both sides, then testing those expectations against reality.
A Hybrid Approach
We have a modding framework for the game. Modded UI can allow for a much nicer solving experience. A macro that expands to “the instruction sequence for purifying lead to gold using a single arm on 3 track” would save a lot of effort. There’s a fast forward button that is more effective than alt-clicking. By allowing mods, still developing legitimate solutions, perhaps the efficiency of each human on Team Nobots could exceed expectations.
There is a bot GrimBOT made by Grimmy, that can identify unused parts in a solution and automatically remove them. It knows how to downgrade multibonders to regular bonders, and multi-arms to regular arms. Here’s a recent min cycle solution for Explorer’s Salve in the journal, that was actually submitted by rebix, and later improved by GrimBOT – can you spot the 10g of unused parts?
If the rushed engineering of Team Nobots ends up being plagued by unnecessary extra parts, a pass by GrimBOT might improve their standing. But it may also be against the principle of team Nobots, and that is a call for Zorflax to make!
Alternately, there are hybrid strategies that are very bot-heavy too. An autosolver could not only solve puzzles, but also judge some list of them as “most likely to be improved with human effort”, and leave the author a shortlist of focus puzzles to help move the needle on their score. The author would then manually crank out better solutions for those puzzles, leading to a hybridized submission with better performance than either human or machine.
The Results
When trying to crown a winner for this contest, we need some sort of aggregate scoring. The scores are applied to the zip file of solutions, so if (for example) rebix participates in Team Nobots but also creates a bot, the two could be scored independently so long as solutions are zipped separately.
The method panic came up with is to award 1 point for every combination of puzzle and metric, and up to 1 additional point based on optimization. The aggregate score is the sum of 3000 different cases of 1 + best/this
.
For example, say the puzzle GEN001 has among its many submissions, one with 65 cost, one with 42 cycles, and one with 18 area, and that these are the best in their respective metrics. Say your only solution takes 100 cost, 100 cycles, and 100 area. Then you get 1.65 points for cost, 1.42 points for cycles, and 1.18 points for area. In effect, this means that every puzzle solved is worth minimum 3 points, totaling 3000 points without ever optimizing. The maximum score is 6000, and would correspond to winning every metric on every puzzle. To do this, one would theoretically need a process, bot or otherwise, that creates 3 separate solutions for every puzzle (which is permitted! Your zip of solutions can have multiple solutions to the same puzzle, and each one is evaluated).
Communication
It is unlikely that every generated solution to every generated puzzle will get screen time in any meaningful way, as this could end up being an intractable amount of data. Probably the first thing revealed at the deadline will be a spreadsheet tallying points. If people had been keeping their methods secret, it may also be a new flood of discussion. Both of these are going to happen on discord, and possibly also reddit.
Really, this is all about “rushed engineering” in Opus Magnum, so as the June 2nd date grows closer we will start to get a glimpse of how that plays out. Maybe by October 20th, there will be some well established bots and it will resemble Sebastian Lague’s tiny Chess AI competition! In that case, we could see a follow up results video giving the highlights.
It’s all up to how people approach the challenge. So good luck!