TED Talk Generator
In this article, I will walk you through the procedure to generate your own TED Talk. We will first create a dataset of TED Talks transcripts and then train a character-level LSTM which we will use to generate our own. For a primer on character-level language models, LSTMs and the inspiration for this article, please check out Andrej Karpathy's amazing blog-post, The Unreasonable Effectiveness of Recurrent Neural Networks.
(Prerequisites: Python 2.7, PyTorch, CUDA 9.0, Ubuntu, pandas)
The data and scripts are in the following GitHub repository:
Download all TED Talk Transcripts
All TED Talks to date are regularly cataloged in this file.
I downloaded the file and renamed it as talks.csv. The first column of the file contains the public urls for each talk.
I appended '/transcript.json?language=en' to every url in the first column using pandas to store the url of the json object of the transcript and created a new csv file of these urls - talks_url.csv
The code is in the file json_url.py
Using wget, I parsed through all the links in talks_url.csv to scrape the TED website and download the json object transcripts to a folder TED_json/ (This was a lot faster than using BeautifulSoup, which quickly ran into rate-limits.) The bash command is available in script_get.sh
Next, from the json objects, I extracted the transcript text. I renamed the first file in TED_json/ from '...language=en' to '...language=en.0' to match the format of the remaining files in the folder and to make it part of the loop sequence in the next script . The actual transcript text is stored in a nested dictionary key 'text' and using pandas, I parsed each json file, extracted the transcript text and stored them as text files in a new folder TED_transcripts/ to build the final dataset. The code is available in json2txt.py
* json.py: talks.csv -> talks_url.csv * script_get.sh: talks_url.csv -> TED_json/ * json2txt.py: TED_json/ -> TED_transcripts/
Train a character-level RNN
The collected transcripts text files in TED_transcripts/ serve as the training dataset. At the time I downloaded the transcripts, it was a whopping 24.8 MB of 2557 text files (this will increase as more talks are updated to the catalog file).
I collated all the text files into one giant string array and followed that with a 0.98-0.2 train-validation split. I used PyTorch to define a 2 layer LSTM with 128 hidden units (in hindsight, I probably could have used more hidden units).
My batch size was 64 and each training sample was a sequence of 128 characters encoded in one-hot representation. The corresponding target was a sequence of 128 characters containing the next letters in the sequence.
During training and validation, I generated the batches using random sampling for sequences. This is equivalent to shuffling the training and validation sets before each epoch.
The final loss function was cross-entropy loss and the network parameters were optimized using Adam.
The whole network was trained over many hours until I got tried at the 150th epoch and the adventure concluded with a final validation loss of 1.26.
For curiosity - each epoch consisted of ~2900 iterations.
The compressed data folder is TED_transcripts.zip and the training code is available in trainTED.py
Deliver a TED Talk
I saved the final model in winner.pth
The IPython file, speech.ipynb contains the code to generate the talk. I set a speech length of 1000 characters. The input prompt to the network is "The next big invention". The network generates the rest of the speech letter-by-letter.
Using the temperature parameter, we can modulate the output. Setting a high temperature leads to more randomness in the generated text - which contributes to greater variance in the words, at times to a nonsensical extent. Here is a sample snippet generated with a temperature of 0.75:
Yeah, a lot of the words are imaginary (happinarchine, capabilitical) and there exists almost no cohesive train of thought.
Let's compare this with a low temperature of 0.2. Here, the network will generate more probable text:
Here we see that the network just repeats itself, no imagined words but it still doesn't make sense. And the sentences are almost 5 lines long!
Now with a moderate temperature of 0.55, we get something like this:
So a speech about plastic, the planet and...health care? It even experiments with humor in the second line. We see a greater variation in sentences, subjects and fewer mistakes in language.
Of course, this model is not perfect and maybe with some hyperparameter tuning and longer training time, it will generate a more engaging speech. I'm excited to see what other language modelling advances lie in the future. Thank you for reading!
Thank you to TED Conferences for spreading great ideas.