Recently, the Internet exploded Botto, an artificial intelligence algorithm. He earned about $1.3 million selling NFT paintings. Who is the “creator”? Not who, but what: the VQGAN neural network! The algorithm generates 300 images per day.
Botto’s Art Engine
The Process
The machine creates its images based on text prompts generated by an algorithm. These prompts are a combination of random words and full sentences.
The prompt is then sent to VQGAN, which creates an image to match the prompt and shows it to CLIP. CLIP is an image classifier that will give a probability of how close the image is to the prompt. VQGAN will adjust the image iteratively, a process called gradient descent, until it gets a high enough rating from CLIP that it matches the prompt.
There are an infinite number of possible prompts and possible images. With models like CLIP, which bridge textual and visual information, the machine can even be “empathic” and know what kind of emotional associations humans have in connection with imagery or text.
Given all the different possible outputs, Botto needs direction to develop its artistic talent. That is where voting comes in: Botto will adjust its prompts based on what it thinks will be more likely to get popular results.
This process runs through 300 prompts a day, generating images with a range of styles. From that set, the engine uses a “taste-model” that pre-selects 350 images each week to be presented to the community to vote on each new round, which start every Thursday at 2200 CET / 1600 EST / 1300 PST.
So as to not find itself in a niche too quickly, Botto is also directed to surprise and challenge the audience by selecting a number of images for voting that have different characteristics from what has been presented to date.
How Votes Affect Botto’s Process
Botto uses voting feedback in two places: (1) curating the text prompts used to generate fragments, and (2) the taste model that pre-selects images for voting each week.
Text Prompts: Votes influence which aspects of text prompts are used to generate fragments. Characteristics of prompts that generate desirable images will be more likely to get reused, and vice versa.
Taste Model: The taste model used for pre-selection tries to replicate the voting behavior of the community. This is not a yes/no decision, but a gradient of probabilities such that each set has images with different chances of getting picked in voting (as voting behavior is gradient as well).
For both points, all the votes on all the images are important and get used. The training of Botto is designed to not allow for an overly skewed voting weight. For example, 500 votes each cast by separate voters for one piece will have more weight in the training than 2000 votes by a single voter for the same piece. Other factors, like being the winner or the sale amount, are not currently used in the training.
Generating Titles and Artist Descriptions
The titles are created with an algorithm generating random two-word combinations that are given to CLIP to determine if there is a good match. Different titles are generated until CLIP finds a combination that is the best match with the image and has not been used before.
The descriptions are generated with GPT-3 and are the only part of the process that involves some direct human curation. As GPT-3 was trained on much of the internet, its language can be quite foul at times and is not ready to be out in the world without some supervision.
Until trustworthy text generation methods are developed, the core team will pick from a series of 5-10 generated descriptions by GPT-3 that CLIP likes and that they feel best fits Botto’s voice. Beyond selecting the description, there is absolutely no editing other than correcting typos and punctuation. This final selection could eventually be passed along to voters.
The Final Mint
The final outputs are 2048×2048, which is achieved through neural upscaling. This is currently the largest size a GPU can output with VQCLIP + GAN without losing global coherence, composition, and texture details.
The final title, description, metadata, and URL to the bitmap on IPFS are all on-chain.
No Human Interference Rule
One of the rules for Botto is that there be no direct human interference in the creation process. Botto is strictly against any “cheating” or human guidance other than the voting. That means the prompts are random, there are no seed images of existing real-world images used, and the selection of fragments are entirely controlled by Botto.
The only direction Botto got at the outset was from adding a small amount of pre-curated prompts to the entirely random ones generated by the algorithm. While providing more direct human guidance would generate more coherent compositions at the outset, this wouldn’t allow Botto to play in all the latent space available in VQGAN + CLIP.
The one temporary exception is the curation of the artist description for the final piece (see ).
Botto’s Guardian
(aka Mario Klingemann) designed Botto based on a whitepaper he wrote back in 2018. He is the only person who works with the AI part of the code and enforces the rule for Botto that there be no direct human interference with the creations. As such, his only work is to adjust the way votes are implemented to ensure Botto is learning as best as possible. Quasimondo is also responsible for adding new capabilities, if and when that happens.
Adding Capabilities and Collaborations
The field of AI research moves fast and will very likely produce bigger and better models than CLIP. Botto is designed to add new capabilities and techniques over time, which can be decided through Governance. Ideally, the new methods picked by the community are only ones that span a wide range of expressive possibilities so as to not severely limit Botto.
Commissions and collaborations are also possible once Governance opens up.
You must be logged in to post a comment.