Creating a fairy tale using a LLM

2024年07月23日火曜日

【この記事を書いた人】
Dimitrios Giakatos

I am working in the IIJ Research Laboratory, focusing on bridging the gap in Internet data knowledge for people from various domains and non-technical users.

「Creating a fairy tale using a LLM」のイメージ

CONTENTS

In the previous article, we introduced the impact of LLMs in enterprises, from code generation to automating customer support. Despite their complexity, we showed that deploying LLMs have become more accessible, making their capabilities available to a wider audience. In this article, the second in our LLM series, we aim to delve deeper into the practical applications of LLMs. Specifically, we will focus on content generation and fine-tuning. Understanding these aspects is crucial for using LLMs effectively in various tasks. To illustrate this, we will demonstrate how a LLM can be utilized to generate fairy tales.

Using LLMs for content generation

LLMs can power various text applications, such as content generation, summarization, translation, classification, chatbot support, and much more.

Content generation is the primary reason for the popularity of LLMs. The ability of a machine to produce content so convincingly human-like that it can confuse even the most discerning reader is truly fascinating. This ability is important in today’s digital world, where anyone can create content and make it public within seconds using social media platforms. For enterprises, social media is a crucial tool for publicity, but generating text ready for publication can be time-consuming. This is where LLMs come in, providing an efficient solution to produce content quickly.

What do we mean by LLM fine-tuning?

Previously, we discussed content generation, however we did not mention anything about the accuracy of the generated content. When we use LLMs, we expect them to create anything we want, but the reality is different. The ability to generate relevant content depends on the quantity and quality of the material the LLM has been trained on. LLMs are typically trained on vast and diverse datasets, so it’s logical that if we use a LLM in a particular domain, the accuracy of the generated content may not be satisfactory.

To address this issue, we can perform fine-tuning on the LLM. This allows us to combine the existing knowledge, ensuring that the model avoids grammar, syntax, and vocabulary mistakes, with the new domain-specific knowledge we provide to the model. In this way, the model will be able to generate domain-specific content without errors.

For example, let’s consider customer support, as discussed in our previous article. To create a LLM that automates the reply process, we would start with a pre-trained model for content generation. Then, we would fine-tune the model using previous replies. This enables the model to learn how to generate content and what format to use based on past interactions.

And our fairy tale just begins…

Fine-tuning a LLM is not as straightforward as running it. It requires a good understanding of the model’s architecture as well as basic knowledge of Machine Learning (ML). As we already mentioned, this blog series aims to provide a basic overview of LLMs and their usage rather than delving into detailed methodologies such as text preprocessing, parameter selection, or learning rates. We will treat many aspects as black boxes, but for those interested in delving deeper into fine-tuning, we recommend checking out the “Build a Large Language Model (From Scratch)” book.

Now, let’s delve into generating original fairy tales.

First, we need to download the necessary Python libraries. Please note that we assume our operating system is macOS. If you’re using a different operating system, refer to the official PyTorch documentation for installation guidance.

pip install transformers torch torchvision torchaudio accelerate datasets requests

Once the libraries are installed, we’re ready to develop our fine-tuned model. We’ll access “The Gutenberg Project“, a digital library containing materials with expired copyrights, to find suitable material for our fairy tale generation model. For this article, we’ve chosen “Grimms’ Fairy Tales” by Jacob Grimm and Wilhelm Grimm. It’s crucial to select material that includes fairy tales to ensure the model generates relevant content.

import requests

f = open("pg2591.txt", "w")
f.write(requests.get("https://www.gutenberg.org/cache/epub/2591/pg2591.txt").text)
f.close()

After downloading the book, we proceed with text preprocessing to ensure the text is formatted appropriately for our task.

storiesTitles = [
    "THE GOLDEN BIRD", "HANS IN LUCK", "JORINDA AND JORINDEL", "THE TRAVELLING MUSICIANS", "OLD SULTAN", "THE STRAW, THE COAL, AND THE BEAN",
    "BRIAR ROSE", "THE DOG AND THE SPARROW", "THE TWELVE DANCING PRINCESSES", "THE FISHERMAN AND HIS WIFE", "THE WILLOW-WREN AND THE BEAR",
    "THE FROG-PRINCE", "CAT AND MOUSE IN PARTNERSHIP", "THE GOOSE-GIRL", "THE ADVENTURES OF CHANTICLEER AND PARTLET", "RAPUNZEL",
    "FUNDEVOGEL", "THE VALIANT LITTLE TAILOR", "HANSEL AND GRETEL", "THE MOUSE, THE BIRD, AND THE SAUSAGE", "MOTHER HOLLE",
    "LITTLE RED-CAP [LITTLE RED RIDING HOOD]", "THE ROBBER BRIDEGROOM", "TOM THUMB", "RUMPELSTILTSKIN", "CLEVER GRETEL", "THE OLD MAN AND HIS GRANDSON",
    "THE LITTLE PEASANT", "FREDERICK AND CATHERINE", "SWEETHEART ROLAND", "SNOWDROP", "THE PINK", "CLEVER ELSIE", "THE MISER IN THE BUSH",
    "ASHPUTTEL", "THE WHITE SNAKE", "THE WOLF AND THE SEVEN LITTLE KIDS", "THE QUEEN BEE", "THE ELVES AND THE SHOEMAKER", "THE JUNIPER-TREE",
    "THE TURNIP", "CLEVER HANS", "THE THREE LANGUAGES", "THE FOX AND THE CAT", "THE FOUR CLEVER BROTHERS", "LILY AND THE LION",
    "THE FOX AND THE HORSE", "THE BLUE LIGHT", "THE RAVEN", "THE GOLDEN GOOSE", "THE WATER OF LIFE", "THE TWELVE HUNTSMEN", "THE KING OF THE GOLDEN MOUNTAIN",
    "DOCTOR KNOWALL", "THE SEVEN RAVENS", "THE WEDDING OF MRS FOX", "THE SALAD", "THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS",
    "KING GRISLY-BEARD", "IRON HANS", "CAT-SKIN", "SNOW-WHITE AND ROSE-RED"
]
stories = {title: [] for title in storiesTitles}
f = open("pg2591.txt", "r")
title = None
for line in f.readlines():
    text = line.strip()
    if "*** END OF THE PROJECT GUTENBERG EBOOK GRIMMS' FAIRY TALES ***" == text:
        break
    if text in stories:
        title = text
        continue
    if title:
        stories[title].append(text)
f.close()
dataset = {
    "text": []
}
for story in stories:
    dataset["text"].append(" ".join(list(filter(lambda x: x!="", stories[story]))))
dataset["text"] = list(filter(lambda x: x!="", dataset["text"]))

Once preprocessing is complete, we need to initialize environmental parameters to avoid potential errors during fine-tuning.

import os

download_dir = './huggingface/transformers'
os.environ["HF_HOME"] = download_dir
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

Next, we download a pre-trained tokenizer to convert the text from “Grimms’ Fairy Tales” into numerical vectors. A tokenizer converts human text into numerical vectors because LLMs understand numerical data. We’ve chosen the tokenizer used by GPT-2.

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
  return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = Dataset.from_dict(dataset)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

The resulting numerical representation of the text is enormous. To make it manageable for fine-tuning, we split it into batches, small blocks containing a portion of the vectors.

block_size = 128
def group_texts(examples):
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  total_length = (total_length // block_size) * block_size
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

lm_datasets = tokenized_datasets.map(
  group_texts,
  batched=True,
  batch_size=1000,
  num_proc=4,
)

Now, we download the pre-trained LLM, ensuring it’s compatible with the tokenizer for optimal results. We’ll be using GPT-2 for this purpose.

from transformers import AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

With the model downloaded, we proceed to fine-tune it. This process will take some time.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
  "test_trainer",
  evaluation_strategy = "no",	
  learning_rate=2e-5,
  weight_decay=0.01,
  num_train_epochs=200,
  use_mps_device=True
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=lm_datasets,
)

trainer.train()

Once fine-tuning is complete, we save the model for future use.

trainer.save_model("my-model")

Now, we’re close to achieving our goal of generating original fairy tales. We load our fine-tuned model and start using it for tale generation.

from transformers import AutoTokenizer, AutoModelForCausalLM
import textwrap
  

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("./my-model")

text = "Once upon a time, there was a king"
inputs = tokenizer(text, return_tensors="pt").input_ids
outputs = model.generate(inputs, max_new_tokens=110)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("\n".join(textwrap.wrap(" ".join(generated_text), 80)))

That’s it! We have our first fairy tale authored not by a human but by a LLM!

Once upon a time, there was a king who had two beautiful daughters. They
slept in two beds all in one room; and when they went to bed, the doors were
shut and locked up; but every morning their shoes were found to be quite worn
through as if they had been danced in all night; and yet nobody could find out
how it happened, or where they had been. Then the king made it known to all the
land, that if any person could discover the secret, and find out where it was
that the princesses danced in the night, he should be king after his death.

What’s next?

In this article, we explained how to fine-tune a LLM for generating fairy tales (the same methodology can be applied to other topics, such as creating poetry). However, it’s worth noting that the content generated by this LLM may not be highly accurate or straightforward. This issue can be addressed by hyper-parameter tuning, using a larger corpus of fairy tales, or even employing more advanced LLMs like Llama 3. Although fine-tuning with these models may take longer, the results will be more accurate. Additionally, it’s worth mentioning that if we aim to create a LLM to generate content in a specific domain, we should also consider training a LLM from scratch. Still, this will require a significant amount of computer resources and time.

In our next article, which will be the last in this series, we will explore LLMs from an industry perspective. We will also introduce a simple example of how we can classify text using a LLM. Stay tuned!

Dimitrios Giakatos

2024年07月23日火曜日

I am working in the IIJ Research Laboratory, focusing on bridging the gap in Internet data knowledge for people from various domains and non-technical users.