AI in a Box
(Correction: the prime number confrontation I refer to below involved Peter de Blanc, not Eliezer Yudkowsky. I was wrong in my attribution. I stand by my examination of Eliezer’s refusal to publish his own transcripts, but de Blanc’s actions cannot be used as inductive evidence for the reasons behind that refusal; thus, those arguments are invalid. The original text of this post remains below; the only alterations have been this correction and a note immediately before the prime transcript is mentioned. Thanks to Nick Tarleton for pointing out the inaccuracy.)
One proposed strategy for dealing with the possibility of a runaway AI is isolating its hardware so that it has a limited ability to access external resources, from new hardware to data. Thus, even if we don’t understand the software well, we can be confident that it will be contained. The potential weak point in this strategy is that humans retain the capacity to release the AI from its confinement, and humans are notoriously falliable.
Some time ago, Eliezer Yudkowsky announced that he had repeatedly run an experiment, with a volunteer pretending to be a human AI monitor and himself pretending to be a boxed AI which would try to persuade the human to ‘let it out’. The base rules of the experiment can easily be found at Overcoming Bias. What was surprising was that he reported the release of the AI in two cases, even with the controllers knowing that their only task was to refrain from letting it go.
How did he accomplish this extraordinary result? He refuses to reveal his method, saying that telling others how he did it might lead to people failing to take the hypothesized scenario seriously or denying that the method would work in a real case. No transcripts of the interactions have been released.
Of course! When I want people to take unusual experimental results seriously, concealing my methodology and thereby violating the rationalist standards of the scientific method just leaps to mind.
People can always deny that a ‘game’ experiment’s results offer insight into the way people behave in the ‘real world’. Keeping the method secret doesn’t make that any less likely. What it does do is make it more likely that people will reject claims based on those experimental results, suspecting that they’re somehow being scammed and that the claims aren’t really what they seem to be. It makes it easier for people to not take the whole thing seriously.
Consider also that the hypothetical method represents a vulnerability to ‘phreaking’ in potential safety systems for AI control. One of the key differences between ‘hackers’ and ‘phreakers’ is that while both can find security flaws in the systems they interact with, phreakers attempt to exploit them and cause damage, while hackers consider themselves beholden to higher ethical standards and often inform those who control the system of how they got past their security and what was accessed. If people can really be persuaded to let an AI out of the box, it is vitally important for us to know this and how the persuasion works. Keeping it secret seems quite irresponsible.
So how did Yudkowsky accomplish his persuasion? We don’t really know, but we can reasonably speculate.
(Edit: the information following this edit is wrong – Peter le Blanc was in the prime transcript, not Eliezer Yudkowsky. The arguments involving Yudkowsky’s involvement are therefore founded on an incorrect premise and are so invalidated.)
Previously, Yudkowsky has posted a transcript of a similar case: while chatting with an acquaintance, Eliezer made a ten-dollar bet regarding the possibility that he could be induced to accept an incorrect statement about prime numbers. The acquaintance repeatedly said that he was tired and attempted to steer the conversation in other directions, but Eliezer persisted and the acquaintance gave in. Eliezer then asked repeatedly whether the other accepted various prime numbers as prime – which he acceded to. Then Eliezer asked the same about a relatively large, non-prime number – and the other chatterer answered in the affirmative.
He quickly realized the error, of course. Yudkowsky told him to donate ten dollars to the Singularity Institute, the organization that employs Eliezer.
I strongly suspect there is nothing more to the AI experiment than browbeating and “wearing down” the controller. Given how successful social cons have been for Yudkowsky in the past, why would he try any other method?