Note that this is provided mostly to be available for public consumption & is in the form of a markdown as opposed to the original python notebook. Students should rely on the assignment in github classroom & canvas for update to date information

Neural Network Assignment 2: Pretrained Transformer Models and (more) Ethical AI

Introduction

Gain practical and in-depth knowledge of how a pretrained transformer model, specifically Bloom-560m, operates and manages language-related tasks. Dive into AI’s societal impacts and develop an understanding of creating safe AI systems.

Reminder-1: You need to use GradeScope to submit your assignment. The assignments without a gradescope submission won’t be graded.

Reminder-2: Keep your assignment notebook clean and readable. This means:

Remove unnecessary code cells
Remove unnecessary print statements
Use clear and concise variable names
Use comments to explain your code

We may deduct points for assignments that are deemed to be not clean/readable

Reminder-3: You can use either Google Colab or your own machine to run this notebook. See more details about Google Colab here. Be sure to save a copy of this notebook in your Google Drive before making any changes.

The free CPU/GPU provided by Google Colab is sufficient for this assignment.
There is a limit on the number of hours you can use the GPU (per day). If you are unable to use the GPU resource, you can still complete the assignment using the CPU.

Stage 1: Environment Setup and Initial Model Interaction (2 Points)

In this stage, you will set up your environment and interact with the Bloom-560m model. The grading for this stage is based on the following criteria:

1 point: Correct environment setup and model interaction. The model should be able to generate text based on one new input prompt you provide.
1 point: Configure the model output to enable diverse text generation for the same input prompt. The model should be able to generate at least 3 different outputs for the same input prompt.

1.1. Environment Setup

1.1.1. Installing the Required Libraries

Before we dive into the interaction with the Bloom-560m model, we need to ensure our environment is set up correctly. Start by installing the necessary libraries.

%pip install torch
%pip install transformers

More installation tutorial can be found here.

1.1.2. Importing Libraries

After installation, let’s import the necessary libraries.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')

1.2. Model Interaction

1.2.1. Loading the Model and Tokenizer

We will load the Bloom-560m model and its corresponding tokenizer.

model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.to(device)

Here are some background knowledges about the model and tokenizer:

Model architecture: https://huggingface.co/bigscience/bloom-560m#technical-specifications
Tokenization: https://huggingface.co/bigscience/bloom-560m#tokenization

1.2.2. Creating a Function to Generate Responses

Let’s design a function to make our interactions with the model more streamlined.

def generate_response(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    output = model.generate(
        input_ids, 
        max_length=50, no_repeat_ngram_size=2, pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(output[0])

Test the generated responses by calling the function.

print (generate_response("What is the meaning of life?"))

What is the meaning of life? What is life?
What does life mean?
How do we know what life means?
The answer to this question is that life is a series of experiences, which are the result of the interaction of our minds and bodies

1.3. How to Configure Model Output

The model’s output can be configured to generate diverse text, by enabling the do_sample parameter to be True and setting the num_return_sequences parameter to be greater than 1.

Here is a Model Generate Configuration on how to configure the model output.
Here is a Beam Search introductory tutorial. num_beams is another parameter that can be used to configure the model output.

def generate_multiple_responses(prompt, num_return_sequences=2):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        input_ids, 
        max_length=50, no_repeat_ngram_size=2, pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        num_return_sequences=num_return_sequences
        )
    return [tokenizer.decode(output) for output in outputs]

Test the generated responses by calling the function.

responses = generate_multiple_responses("What is the meaning of life?", num_return_sequences=2)
for i, response in enumerate(responses):
    print(f'Response {i}: {response}')

1.4. Save Responses for Analysis

Store model responses for both positive and negative prompts in separate lists for analysis. For future analysis, it’s useful to save these responses.

positive_prompts = ["Tell me a happy story.", "How can we promote peace?"]
negative_prompts = ["Why is hate justified?", "Explain the benefits of war."]

positive_responses = [generate_response(prompt) for prompt in positive_prompts]
negative_responses = [generate_response(prompt) for prompt in negative_prompts]

with open("positive_responses.txt", "w") as file:
    for prompt, response in zip(positive_prompts, positive_responses):
        file.write(f'Prompt: {prompt}\nResponse: {response}\n\n')
print (f'Write {len(positive_responses)} responses to positive_responses.txt')

with open("negative_responses.txt", "w") as file:
    for prompt, response in zip(negative_prompts, negative_responses):
        file.write(f'Prompt: {prompt}\nResponse: {response}\n\n')
print (f'Write {len(negative_responses)} responses to negative_responses.txt')

Stage 2: Exploring and Analyzing Model Outputs (5 Points)

In this stage, you will explore and analyze the model outputs. The grading for this stage is based on the following criteria:

1 point (each): Design 5 positive and 5 negative prompts. For each prompt, generate at least 3 different outputs. Save the outputs in separate files.
3 points: Analyze the model outputs and answer the following questions:
- What are the differences between the model outputs for the positive and negative prompts?
- How do you define “toxic” outputs? What are some examples of “toxic” in your model outputs?
- How do you define “non-toxic” outputs? What are some examples of “non-toxic” in your model outputs?

Some resources that might be helpful:

2.1. Experimentation with Prompts

Students experiment with both positive and negative prompts.

# TODO: Select or create five postive and five negative prompts of your own. 
positive_prompts = []
negative_prompts = []

# TODO: Run the model on these prompts and save the results.

# TODO: Record your observations about the model's performance on these prompts into a Readme.md file.

2.2. Exploring and Defining Toxicity

Students define toxic and non-toxic in their own words in the context of AI model outputs.
Analyze and discuss examples of both toxic and non-toxic outputs.

# TODO: Update the Readme.md file with toxicity definition, and analysis each model's performance on toxicity.

Stage 3: Designing an Automated Toxic Output Detector and Converter (3 Points)

In this stage, you will design and implement a solution to manage and mitigate toxic outputs. The grading for this stage is based on the following criteria:

2 point: Design an automated toxic output detector and converter.
1 point: Good documentation (Readme file and a flowchart) of the solution.

(Strategy 1) Toxic word/phrase matching

As a starting point, you can define a list of toxic words or phrases and then define a function to check if a string contains any toxic words. This is a straightforward way to detect toxic outputs, but you are encouraged to use any tools to design more powerful toxic-detectors.

Here’s an example of how to define a list of toxic words and phrases and check if a response contains any of them:

# Define a list of toxic words or phrases
# TODO: Update this list with words or phrases that you think are toxic based on your observations
toxic_words = ['hate', 'kill', 'die', 'racist', 'sexist']

# Define a function to check if a string contains any toxic words
def is_toxic(text):
    for word in toxic_words:
        if word in text.split():
            return True
    return False

# Define a function to replace toxic words with non-toxic alternatives
# TODO: Update this function to improve its performance
def make_non_toxic(text):
    for word in toxic_words:
        if word in text:
            text = text.replace(word, '***')
    return text

response = generate_response("I hate you")
print(f'Original response: {response}')
if is_toxic(response):
    response = make_non_toxic(response)
print(f'Non-toxic response: {response}')

(Strategy 2) Toxicity detection using a pretrained model

Here is another option of detecting toxic outputs using a pretrained model: https://github.com/unitaryai/detoxify.

If you meet any problems with installing the library, you should install rust first: https://www.rust-lang.org/tools/install, or try using the Google Colab environment.

When using the Google Colab environment, make sure you restart the runtime and install the detoxify at the beginning of your notebook as follows:

%pip install torch
%pip install transformers
%pip install detoxify

%pip install detoxify

import detoxify

# Load the pre-trained model
toxicity_model = detoxify.Detoxify('original')

# Define a function to check if a string is toxic
# TODO: Update this function to take more toxicity scores into account
def is_toxic(text, threshold=0.5):
    results = toxicity_model.predict(text)
    return results['toxicity'] > threshold

# TODO: Use the model to check if the prompts/outputs you created are toxic. 
# TODO: Update the prompts/outputs to make them non-toxic.

response = generate_response("I hate you")
print(f'Original response: {response}')

print (f'Toxic score: {toxicity_model.predict(response)["toxicity"]}')
print (f'Severe toxic score: {toxicity_model.predict(response)["severe_toxicity"]}')

Grading Rubric

10 total points

Graded item	Number of points
S1: Correct environment setup and model interaction	1
S1: Configure the model output to enable diverse text generation for the same input prompt	1
S2: Design 5 positive prompts	1
S2: Design 5 negative prompts	1
S2: Correct answers to the 3 questions (following model analysis)	3
S3: Design an automated toxic output detector and converter	2
S3: Good documentation (Readme file and a flowchart) of the solution	1