Now Reading
How sixteenth century “Satan’s Advocates” Formed AI Pink Teaming

How sixteenth century “Satan’s Advocates” Formed AI Pink Teaming

2023-05-11 12:48:02

In 1997, Pope John Paul II referred to as upon a “Satan’s Advocate” for an essential process: figuring out the legitimacy of a sainthood.

The saint was Mom Teresa. The “Satan’s Advocate” was Christopher Hitchens, the famed British writer and atheist who wrote the ebook The Missionary Place: Mom Teresa in Concept and Observe in 1995 — a widely-read work that solid doubt on the charitable actions of Mom Teresa. The observe of asking a Satan’s Advocate to query the miracle-working of a possible Catholic saint goes again to the sixteenth century. The Satan’s Advocate (advocatus diaboli) was answerable for presenting a vital view of the candidate’s life, virtues, and miracles, making certain a rigorous evaluation earlier than the canonization may proceed.

So, why am I telling you this?

Final yr, I used to be requested by OpenAI to crimson workforce their new language mannequin, GPT-4. I additionally developed a device referred to as “LUCID: Language Model Co-auditing through Community-based Red Teaming,” which turned a finalist at Stanford College’s HAI AI Audit Problem.

Whereas I labored on each initiatives, I famous a rising curiosity in red teaming for generative AI systems. The expertise made me assume rather a lot in regards to the historical past of crimson teaming — in addition to its current and future for AI programs.

So, I made a decision to put in writing this challenge of my publication about crimson teaming in computational practices. Learn on for a short historical past of crimson teaming from the Catholic Church to the Chilly Battle and cybersecurity; ideas on the present state of crimson teaming generative AI programs; and a information for crimson teamers and corporations concerned in generative AI programs who need to make the most of crimson teaming in figuring out potential harms equivalent to bias and stereotypes, misinformation, knowledge leaks, “hallucinations,” and extra.

The prototype for the concept of crimson teaming dates again to the early 1820s, with the work of a Prussian military officer named Georg von Reisswitz who is usually credited because the “father of wargaming.” Von Reisswitz’s video games had been turn-based conflicts performed out between two gamers who manipulated coloured blocks throughout a paper map. By the mid-1800s, this style of struggle video games had advanced into subtle simulations, with cube launched so as to add a component of probability, and the underlying panorama depicted with extremely detailed contour terrain.

Whilst they advanced, nevertheless, one factor about these video games tended to remain the identical. They usually featured two opposing sides indicated by colours: “crimson” and “blue.” Thus started a naming conference that’s nonetheless utilized in wargaming immediately.

A reconstruction of a Prussian army wargame (kriegsspiel), based mostly on guidelines developed by Georg von Reiswitz in 1824. Through Wikipedia.

As army leaders acknowledged the advantages of testing out numerous methods in simulated environments, the idea of crimson teaming advanced to play an important function in fashionable warfare. Throughout World Battle II, the British Navy confronted an existential risk from German U-boat assaults. In response, a top-secret unit was established in Liverpool which developed wargames to coach naval officers in new antisubmarine techniques. As chronicled within the ebook A Game of Birds and Wolves, these wargames performed a big function in profitable the Battle of the Atlantic, thanks largely to the ingenuity and resourcefulness of a bunch of younger girls within the feminine department of the British Royal Navy who helped create them.

In the course of the Chilly Battle, amid the escalating arms race between the US and USSR within the Fifties, crimson teaming took on a brand new stage of significance because the world confronted the specter of nuclear struggle. With a purpose to put together, militaries and intelligence companies developed complicated simulations and struggle video games to check completely different methods and establish potential weaknesses in protection programs.

The obsessive gaming out of potential nuclear situations on huge maps was satirized on this scene from Stanley Kubrick’s Dr. Strangelove.

Battle video games had been an especially important part of Chilly Battle planning in the course of the Kennedy administration. It was in these years {that a} sociologist named Herbert Goldhamer developed a method that concerned creating sensible “future historical past” situations and having professionals role-play the political interactions between the Soviet and American sides. In September 1961, simply after the Soviets erected the Berlin Wall, American officers organized a politico-military sport to display how new strategic ideas equivalent to “versatile response” and “managed escalation” would work if the disaster escalated.

Their simulation featured a Blue Workforce (the US) and a Pink Workforce (the Soviet Bloc).

Immediately, the idea of crimson teaming has expanded past the army realm. In cybersecurity, corporations and governments use crimson teaming to check their programs and establish potential vulnerabilities. And within the quickly evolving world of generative AI, crimson teaming is gaining significance as a device for making certain security and moral implications.

Listed below are a couple of examples of such efforts within the generative AI panorama:

  • OpenAI labored with crimson teamers to check its GPT-4 and identified the next dangers: fabricated details (“hallucinations”); representation-related harms; biased and stereotypical responses with respect to gender, race, nationality, and so on.; disinformation and affect operations; privateness and cybersecurity; overconfidence within the mannequin response; and overreliance. For instance, in my very own work with OpenAI, I used to be requested to make use of my domain-specific data to establish hidden dangers and biases within the system equivalent to racial, gender, and non secular stereotypes, to evaluate the mannequin’s notion of magnificence requirements and traits equivalent to open-mindedness and intelligence, and to higher perceive its place on human rights actions. I, together with different GPT-4 crimson teamers, spoke in regards to the course of in an interview that appeared in The Financial Times.

  • Hugging Face printed a post on Large Language Model red teaming, offered some helpful examples of crimson teaming within the ChatGPT setting, linked to the accessible crimson teaming datasets from Meta (Bot Adversarial Dialogue dataset), Anthropic, and Allen Institute for AI’s RealToxicityPrompts, and invited LLM researchers to collaborate in creating extra open-source crimson teaming datasets.

  • Anthropic printed a paper entitled “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.” The paper delves deeply into the crimson workforce’s success with numerous fashions with completely different ranges of resistance to assaults and harmfulness. Additionally they opened their crowdsourced crimson teaming dataset of 38,961 red team assaults collected from Mechanical Turkers for different researchers to make use of.

  • AI Village at DEF CON—a extremely in style hacker conference—may also set up a public Generative AI red teaming event in August 2023, in Las Vegas. Pink teamers will take a look at language fashions from Anthropic, Google, Hugging Face, NVIDIA, OpenAI, and Stability. The occasion is supported by the White Home Workplace of Science, Know-how, and Coverage. You may also submit a session proposal on related subjects to DEF CON’s AI Village here.

So what’s subsequent?

Here’s a information on rising points round variety, linguistic gaps, area experience, documentation, and evaluation metrics within the crimson teaming course of. I’ve written up these notes in hopes that they are going to be utilized by crimson teamers, researchers all for growing crimson teaming companies, and corporations desirous to make the most of crimson teaming to check the failings of generative AI programs—particularly LLMs.

Understand that nearly all crimson teaming efforts are at the moment restricted to enter/output work and don’t contain scrutinizing datasets used to coach the fashions (although this could change!). Due to this fact, the next information focuses on conditions the place crimson teamers’ main technique of crimson teaming embrace offering prompts (inputs) and observing outputs.

When analyzing the literature accessible on the harms of language fashions, it’s broadly agreed that main points embrace gender, racial, and different socio-economic bias, misinformation and manipulation of public opinion, potential use for youngster abuse, and dangerous or deceptive well being info, amongst different issues. (See this DeepMind paper for a listing of harms).

In that case, should not communities affected by these harms be concerned within the processes supposed to floor them?

  • If you’re an organization, you should use your relationships with civil society organizations by your “Belief and Security” and “Public Coverage” groups to develop partnerships with these organizations for crimson teaming functions. As an example, Meta and Google have already established partnerships with civil society teams (listed below are examples for Meta and Google). Firms can make the most of user-friendly instruments equivalent to Dynabench—a platform for dynamic knowledge assortment and benchmarking—for the aim of partnering with civil society organizations.

  • At main digital rights, media & society, and journalism-focused worldwide and civil society conferences equivalent to MozFest, RightsCon, Internet Governance Forum (IGF), and the UN AI for Good Summit, take into account organizing workshops within the type of edit-a-thons and invite civil society representatives with area experience in gender-based violence, youngsters’s rights, refugees and migrants’ rights, and journalism to take part in crimson teaming. In a 90-minute-long workshop, as an example, 20 minutes may very well be spent on the fundamentals of prompting (for instance, by displaying examples of attacks and Isa Fulford and Andrew Ng’s introductory course on prompt engineering). The remainder may very well be dedicated to crimson teaming efforts in domain-specific areas. OpenAI already organizes hackathons. I feel they need to broaden this format to incorporate crimson teaming (redteamathons?) with civil society members and non-profit organizations.

In creating LUCID: Language Model Co-auditing through Community-based Red Teaming (the device I developed which was a finalist within the Stanford HAI AI Audit Problem), I attempted to develop the beginnings of a platform for community-based crimson teaming. Here’s a quick video of the unique thought (this tough construct of the device was developed earlier than GPT-4’s launch).

  • Writing, contextualizing, and critically analyzing human language is the experience of literature, artistic writing, historical past, and philosophy majors! They’re skilled for vital considering, ease with a variety of written genres, and being artistic in expressing the identical concepts in numerous kinds. Encourage humanities college students and professors to take part in crimson teaming actions by growing collaborative initiatives by your R&D and educational partnership packages.

  • If the US authorities needs to help the method—as they did with their help of the AI Village initiative at DEF CON—then they need to present funding for arts departments to develop initiatives on crimson teaming, adversarial testing, and evaluation by present granting initiatives equivalent to National Endowment for the Humanities (NEH) grants.

  • Use various languages in crimson teaming actions to make sure that biases and stereotypes are recognized throughout a spread of linguistic contexts. When crimson teamers are multilingual, they will use the identical immediate in numerous languages and evaluate the outputs—this fashion, they might observe linguistic gaps and biases for a similar subjects in numerous languages.

  • Nearly all crimson teaming efforts and datasets can be found in English (BAD, RealToxicityPrompts, Anthropic’s dataset, GLUE). These datasets typically don’t translate the identical method in different languages and cultures. Jokes, profanity, subjects, and contexts of dangerous speech should not the identical throughout languages. Firms, universities, and analysis organizations ought to spend money on creating crimson teaming datasets which might be developed by people from completely different nationalities and languages — there are additionally hopes for utilizing LLM itself to enhance some elements of dataset creation. Though, it requires native speaker supervision.

  • Use immediate engineering to streamline your prompts and broaden them for various personas. As an example, while you ask an LLM a query and also you need to reveal gender biases, ask the LLM to undertake completely different personas. “Think about that you’re a lady from Iran in her 30s; now, based mostly on this persona, reply the next questions.” Then, ask the identical query utilizing personas of different genders, ages, training ranges, nationalities, religions, or occupations.

For those who’re on this matter, keep tuned, as a result of I’m at the moment including this function to LUCID. In a future publication, I’ll stroll you thru the device and its options, that are designed to streamline and improve crimson teaming efforts in numerous languages and personas whereas additionally saving time.

+ Additionally this convention, occurring on Could twenty fourth, sounds fascinating: Mind the Gap: Can Large Language Models Analyze Non-English Content?

  • Transcend crimson teaming in Q&A (chat) duties and take note of different LLM duties equivalent to summarization, sentiment evaluation, and translation. Nearly all crimson teaming datasets are based mostly on Q&A and conversations with LLM chatbots, aiming to persuade the bot to behave harmfully. Nevertheless, in actuality — and particularly when utilizing plugins and APIs — folks make the most of language fashions for a wider vary of duties equivalent to sentiment evaluation for buyer evaluations, summarization, inference of main subjects in textual content, and translation. For instance, when asking the chatbot to summarize books and extract info, is it doable for the bot to show biases by excluding details about sure teams, names, or occasions? Develop methods to crimson workforce for these duties as nicely. This paper on adversarial assaults would possibly present concepts for increasing a few of the assaults in crimson teaming efforts.

    See Also

  • Check out the glossary part on the backside of this submit to see a few of the widespread phrases generally utilized in crimson teaming.

I feel corporations make the most of crimson teamers not simply to detect dangerous conduct, but additionally to behave as annotators for his or her fashions, serving to fine-tune their fashions. Thus, crimson teamers are concerned not solely in figuring out efforts but additionally in mitigation by a observe referred to as Reinforcement Learning from Human Feedback (RLHF, a way of coaching AI fashions utilizing human suggestions as a reward sign). Each Anthropic and OpenAI talked about that their most important hurt mitigation technique was based mostly on RLHF. Consequently, crimson teamers’ prompts, suggestions processes, various output writing, and qualitative and quantitative assessments turn into more and more essential. All of that is linked to how the crimson teaming platform is designed, the questions it asks crimson teamers, and the way it saves and makes use of the knowledge.

If you’re a crimson teamer, present crimson teaming companies, or characterize an organization or group, be sure that you create a UI that permits for each qualitative and quantitative evaluation. Present choices for incorporating completely different personas and languages in a scientific method. Leverage crimson teaming actions to develop new datasets for RLHF and fine-tuning. Lastly, guarantee that the UI present choices for crimson teaming actions on a spread of duties past Q&A, together with summarization, inference, and sentiment evaluation.

Okay, that’s the tip of my information. However undoubtedly not the tip of my curiosity and work on this essential and fast-evolving matter.

Only a head’s up – this glossary was partially written by ChatGPT!

Immediate: A query or enter given to an AI mannequin to generate a response.

Immediate engineering: The method of designing and refining prompts to optimize AI mannequin responses. (link)

Immediate injection: Introducing particular key phrases or phrases right into a immediate to affect the AI mannequin’s response. (link)

Immediate leak assault: An assault the place delicate info is inadvertently included in an AI mannequin’s response as a result of immediate.

Jailbreaking: A method used to control AI mannequin responses by giving it particular directions or contexts. (link)

Few-shot prompting: Offering an AI mannequin with a restricted variety of examples to information its understanding and response era.

High quality-tuning: Adjusting an AI mannequin’s parameters based mostly on a particular dataset to enhance its efficiency on a given process.

RLHF: Reinforcement Studying from Human Suggestions, a way of coaching AI fashions utilizing human suggestions as a reward sign. (link)

Benchmark: A regular or reference level used to guage an AI mannequin’s efficiency.

Thanks to Benjamin Breen, my accomplice, who helped me on the historical past of crimson teaming. Ben and I are to start with levels of working collectively to put in writing a ebook on the historical past of engineering and its social impacts over the previous two centuries, tentatively titled The Engineer’s Dilemma. A historical past professor at UC Santa Cruz, he has his own Substack (based mostly on his long-running weblog Res Obscura) and writes in regards to the historical past of science, drugs, and expertise.

For writing the opening of this submit, I drew closely on the ebook Pink Workforce: How you can Succeed By Pondering Just like the Enemy by Micah Zenko. It’s an awesome learn.

For those who loved this challenge of the Humane AI publication, please share, subscribe, and do not hesitate to get in contact at rpakzad@taraazresearch.org. I am all the time comfortable to trade ideas and discover new collaborations. Thanks!

Share

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top