Hey, Pc, Make Me a Font
This can be a story of my journey studying to construct generative ML fashions from scratch and educating a pc to create fonts within the course of. Sure, real true sort fonts, with a capital-only set of glyphs. The mannequin takes a font description as an enter, and produces a font file as an output. I named the challenge ‘FontoGen’.
Listed here are a number of examples of fonts generated by the FontoGen mannequin:
daring, sans
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG?
italic, serif
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG!
techno, sci-fi, extrabold
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
If you wish to generate your very personal font, head to the GitHub project, clone it, and don’t neglect to depart a star. Then obtain the weights from Huggingface, and comply with the directions right here. And if you wish to study the complete story, hold studying.
I am gonna go construct my very own theme park
— Bender Bending Rodríguez
Intro
Initially of 2023, when AI began creating ripples throughout the web, like many others I turned very within the subject. I used to be sucked into the world of constructing memes with Secure Diffusion, coaching LoRAs on my buddies’ faces, and fine-tuning text-to-speech fashions to imitate well-known voices.
Sooner or later, I began taking a look at text-to-SVG era which, because it turned out, is a a lot more durable activity in comparison with raster-based text-to-image era. Not solely is the format itself fairly complicated, it additionally permits for representing the very same form in many various methods. As I used to be involved in studying find out how to construct a generative ML mannequin from scratch, this turned my weekend challenge.
The Thought
As I started exploring other ways to generate SVGs, I got here throughout the IconShop2 paper which achieved fairly spectacular outcomes. It took me a while to breed them by constructing a mannequin based mostly on the outline within the paper. After lastly attaining close-enough outcomes, I realised that the method of producing fonts might be just like the method of producing SVGs, and began engaged on the challenge.
In comparison with SVG photos, fonts are each simpler and more durable to generate. The simpler half is that fonts don’t have the color part current in vibrant SVG photos. Nonetheless, the more durable half is {that a} single font consists of many glyphs, and all glyphs in a font should preserve stylistic consistency. Sustaining consistency turned out to be a big problem which I am going to describe in additional element beneath.
The Mannequin Structure
Impressed by the SVG era strategy described within the IconShop paper, the mannequin is a sequence-to-sequence mannequin skilled on sequences that encompass textual content embeddings adopted by font embeddings.
Textual content Embeddings
To provide textual content embeddings, I used a pre-trained BERT encoder mannequin, which helps to seize the “that means” of the immediate. The textual content sequence is restricted to 16 tokens, which in BERT’s case roughly corresponds to the identical variety of phrases. Whereas the textual content immediate may doubtlessly be longer, reminiscence constraints have been a big concern for my single-GPU setup. So, all textual font descriptions current within the dataset have been summarised to a set of some key phrases with the assistance of OpenAI’s GPT-3.
Font Embeddings
As a way to produce font embeddings, the fonts first should be transformed to a sequence of tokens just like how textual content is tokenised with the BERT tokeniser. On this challenge, I’ve solely thought-about the glyph shapes and ignored the width, peak, offset, and different helpful metadata current within the font recordsdata. Every glyph was downsampled to 150×150 and normalised. I discovered that the 150×150 dimension preserves font options with minimal glyph deformation, which was extra pronounced at decrease resolutions.
I used Python’s fonttools to parse font recordsdata which may conveniently course of every glyph as a sequence of curves, strains, and transfer instructions, the place every command could be adopted by zero or extra factors. I made a decision to restrict the glyph set to the next glyphs to get a minimal usable font.
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,!?
The ultimate mannequin vocabulary wanted to signify 22547 completely different tokens:
- 40 glyphs,
- 5 line path operations: moveTo, lineTo, qCurveTo, curveTo, closePath,
- 2 tokens to signify EOS (finish of sequence) and PAD (padding),
- 150^2 = 22500 completely different factors.
The token sequence is then transformed into an embedding vector utilizing learnable embedding matrices. Moreover, as proposed within the SkexGen1 paper, separate matrices have been used particularly for x and y coordinates. And the ultimate step was to use positional embeddings.
Transformer
The mannequin is an autoregressive encoder-only transformer consisting of 16 layers and eight blocks. The mannequin’s dimension is 512, leading to a complete of 73.7 million parameters.
| Title | Sort | Params
-----------------------------------
0 | mannequin | Fontogen | 73.7 M
-----------------------------------
73.7 M Trainable params
0 Non-trainable params
73.7 M Whole params
294.728 Whole estimated mannequin params measurement (MB)
I computed the loss utilizing easy cross-entropy and disregarded the padding token.
Consideration
Each time part of the glyph is generated, a number of elements affect the choice on which token comes subsequent. First, the mannequin immediate impacts the glyph’s form. Subsequent, the mannequin wants to think about all beforehand generated tokens for that glyph. Lastly, it must take note of all different glyphs generated up to now to make sure consistency in fashion.
When doing preliminary experiments with solely a handful of glyphs, I began with full consideration. Nonetheless, because the sequence size elevated, this strategy turned impractical, prompting a shift to sparse consideration. After exploring numerous choices, I settled on BigBird3 consideration. This strategy helps each world consideration, to concentrate on the preliminary immediate, and window consideration, which observes N earlier tokens, capturing the fashion of a number of previous glyphs.
Given {that a} single glyph can have a variable variety of tokens, I set the eye mechanism to think about at the very least the three previous glyphs. Whereas more often than not, the strategy has been profitable at preserving the general font fashion, in some complicated instances, the fashion would slowly drift into unrecoverable mess.
Coaching
To coach the mannequin, I assembled a dataset of 71k distinct fonts. 60% of all fonts solely had a obscure class assigned to them, whereas 20% fonts have been accompanied by longer descriptions, so the descriptions have been condensed to a couple key phrases utilizing GPT-3.5. Moreover, I included 15% fonts the place the immediate solely contained the font’s identify, and the remaining 5% of the dataset had an empty textual description assigned to them to make sure that the mannequin is able to producing fonts with no immediate in any respect.
Because of massive reminiscence necessities, my Nvidia 4090 with 24G of VRAM may solely match two font sequences in a single batch, and I’d usually observe gradient explosions. Utilizing gradient accumulation and gradient clipping helped to resolve the difficulty. The mannequin was skilled for 50 epochs which took 127 hours. I restarted coaching as soon as after 36 epochs, and stored coaching for an additional 14 epochs with diminished gradient accumulation. The coaching was stopped when the validation loss confirmed little or no enhancements.
Chasing Efficiency
Reaching good coaching efficiency was important since I used to be coaching on a single GPU, and coaching took a big period of time.
- Within the preliminary iteration, I processed font recordsdata and textual descriptions instantly throughout the mannequin on every step. Whereas this codebase construction streamlined prototyping, it meant that the identical duties needed to be repeated again and again, making the coaching course of slower. Moreover, having BERT loaded in reminiscence meant that it could take up treasured VRAM. By shifting as a lot as attainable to the dataset preprocessing stage, I achieved a threefold efficiency enhance.
- Initially, the mannequin relied on huggingface’s transformers. Migrating the code to xformers4 gave a really seen enhance in pace and reminiscence utilization.
As a substitute Of Conclusion
I achieved what I got down to do – I realized find out how to construct a generative transformer mannequin, and constructed a challenge that is able to producing fonts as a facet impact. However there are such a lot of issues that I nonetheless have not tried. For instance, what if the mannequin might be built-in into the present font editors in order that the font designer solely creates a single glyph A, and all different glyphs are generated by the mannequin. Or possibly the font editor may recommend the management factors for bézier curves as they’re being drawn! The horizon is huge, and there is a lot left to discover.
In the event you’ve learn this text, and also you assume that I’ve ignored one thing apparent, there is a good probability I did! I am all the time eager to study extra, so please attain out and let me know what you assume.
Thanks to
- Paul Tune for answering many questions I had about constructing transformer fashions.
References
- SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks
- IconShop: Text-Guided Vector Icon Synthesis with Autoregressive
Transformers - Big Bird: Transformers for Longer Sequences
- xFormers: A modular and hackable Transformer modelling library
Talk about on
Subscribe
I will be sending an e mail each time I publish a brand new publish.
Or, subscribe with RSS.