Fork me on GitHub

Monday, February 18, 2019

Writing historical fiction with AI

As many have reported all over the web, AI can generate words and sentences, not necessarily meaningful or grammatical but we have models that understand linguistic structure.  We can generate clickbait, semi-scientific papers, alternative Harry Potter fan fiction. I conducted one of these experiments, and of course it had to involve history. What better way than re-imagine the life of dead legends. I accidentally stumbled on Benjamin Franklin's autobiography when I was perusing Project Gutenberg. Turns out Benjamin was such a fascinating character I'm surprised we don't have a movie about him yet.

1. Who is Benjamin Franklin?

Also known as the Founding Father of the United States, he was indeed a polymath; inventor, scientist, printer, writer, postmaster, politician and diplomat without finishing school. He is known to be the founding father of the United States without being President. His famous achievements include part of drafting of the US constitution, Declaration of Independence, coining the terms for  positive and negative charges, his famous Kite experiment and publishing the work Poor Richard Almanack. I may have missed a few but you get the point.

2. Natural Language Generation (NLG)

Natural Language Generation with AI is done by creating and training models with word/char sequences and then using the trained model to generate text. This is done in the following steps.
1) Data cleanup:  removing stop words, punctuation,words that need not be learnt etc.

2) Word/Char representation: Words have to represented in numbers so that mathematical operations can be performed. The unit of representation can be a word or a char. The are kinds of representations: fixed and distributed word representations.


  • Fixed representations are arbitrary representations that make no assumptions on semantics or similarities between words. Examples include dictionary lookup, one-hot encoding and contextual representation. With dictionary lookup, all words are prepossessed and then merged into a corpus and each word is represented by its index in the corpus. This representation is easy to derive but it can be misleading if the indices are assumed to represent natural ordering of the words. e.g if F("house") = 2 and F("table") = 6, it shouldn't be implied that "table" > "house". With one-hot encoding, each word is represented by vector with zeros at all indices except 1. The length of each vector is N+1 where N is the size of the corpus, one is added to account for unknown words e.g F("house") = [1,0,0], F("table") = [0,1,0] , F("unknown") = [0,0,1]. No ordering is implied since these vectors are tangential but these representations are sparse and use lots of memory. In Contextual representation, words are represented in a matrix of frequency terms over sliding windows or sentences.




  • In Distributed representations words are represented as dense features with implied semantics and relationships between them.  e.g F("King") - F("Queen") = F("Man") - F("woman"). The most popular Distributed representations are Word2Vec and Glove which are derived from local context and/or word statistics. Although they are powerful representations, out of vocabulary vectors can not be represented.Detailed explanations can be found in this  article.

3)  Model training. There are three popular models for Natural Language Generation, these are: RNN, LSTM and GRUs. Unlike Feed Forward Neural Networks, these models have loops and  are trained with a sequential data with the objective of generating correct sequences.

A: Recurrent Neural Network (RNN)

Recurrent Neural Networks are in my opinion of them. The network is made of a cell(s) that given input X(t) at time step t and internal state H(t), updates internal state and produces output Y(t). Due to sequential nature the network can be visualized unrolled as shown below.


Source: https://blog.nirida.ai/predicting-e-commerce-consumer-behavior-using-recurrent-neural-networks-36e37f1aed22
Source: https://stats.stackexchange.com/questions/166390/how-to-calculate-weights-for-rnn.
where f(H) is a tanh function and f(O) is a softmax function.

These networks can be stacked to make Deep RNN as shown below.
Source : https://www.cs.toronto.edu/~tingwuwang/rnn_tutorial.pdf.

Karpathy's blogpost gives a good testimony on how effective they can be.

B: Long Short Term Memory network (LSTMs)

It is difficult to propagate long-term dependencies with Vanilla RNNs. The memory vector h(t) ideally remembers all relevant information but in practice old inputs are forgotten. As the name suggests LSTMs are designed to mitigate this problem. Unlike RNN cell, LSTM cell is made of four Neural Networks and both output and cell state are passed from cell to cell. Below is a representation of LSTM

Source: https://www.researchgate.net/figure/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of-an-LSTM-cell_fig5_329362532



where X => point wise multiplication ,

Essentially the sigmoid layers are gates. The sigmoid function outputs a value between 0 and 1.
Source: http://mathworld.wolfram.com/SigmoidFunction.html

By point-wise multiplying with respective input they decide how much of a multiplicand is propagated. 0 means propagate nothing and 1 means propagate everything. These gates  input x(t) and h(t-1) but their weights will differ. C(t)_bar can be thought as the current cell state derived from current input x(t) and previous output h(t-1). The gates i(t) and f(t) will determine how much C(t)_bar and C(t-1) respectively will be propagated to C(t). 
 Similar to RNNs, the current output h(t) is a function of tanh of current state C(t). The tanh output is gated by sigmoid gate o(t).  Christopher Olah's blogpost also does a good job at deciphering these networks.

At the cost of number of trainable weights, LSTMs solve the problem of long term dependencies. There are many variants of LSTMs, a popular one is the Gated Recurrent Unit (GRU) described below.

C: Gated Recurrent Units(GRU)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs


The GRU is similar to RNN because the cell state is stored in one variable h(t) which is also the cell output; it is a variant of LSTM due to the presence of sigmoid gates z(t) and r(t). Similar to C(t)_bar above h(t)_bar can be thought as the current cell state derived from current input x(t) and h(t-1). r(t) gates how much h(t-1) propagates to h(t)_bar, it is also known as a reset gate because it allows the network to forget past information h(t-1) when it is 0. z(t) also known as the update gate, gates how much h(t)_bar and h(t-1) propagates to h(t) which is both the cell output and cell state. When z(t) is 0, it means the output is a function of past information and current input is not relevant; likewise if 1 then past information is irrelevant.


3. Experiment

My objective was to observe what a model can learn about Benjamin Franklin from the blurbs it generates.  A big part of the problem was collecting training data. I decided on Wikipedia for two major reasons: Wikipedia pages are uniform and structured making data cleaning easy. In addition to that, I wanted to make use linked Wikipedia articles to increase context and training data volume. Its easier when these additions also follow the same format. I also added the Autobiography  of Benjamin Franklin to the training data, because, why not?

I created a scraper to download, clean and merge all Wikipedia related to Benjamin Franklin. the clean function removes the references section, metadata enclosed by '{{}}' as well as non-alphanumeric characters. It can be customized to scrape any other topic for instance below  command merges and cleans all articles related to Isaac Newton into a local textfile 'newton.txt'

python wikipedia-scraper.py --title "Isaac Newton" --save_dir "mydata" --combine_file "newton.txt"
You can find my data directory here.

Ashish's repo met all the requirements I needed, it  is written in python, has the three popular models and has a simple interface for both training and testing as shown below.
python train.py --epochs 500

4. Results

python sample.py -n 100 --prime "Benjamin loves"
where n is the number of words to generate and prime is starting text. Considering how long training took, when I ran this command I was expecting magic ..

Benjamin loves states patent social relations of the universities of the south of the house of confederation, us based on poets medal of the original system and the edition of ghana , and directors and digest, engineering , christian mysticism , belarus .

[He didn't go to university, but he wrote a pamphlet that resulted in establishment of the University of Pennyslavia . I highly doubt the other things on the list]

He married occasion of one pursuing about a hidden starting in the 1770s and draw of any year later, but discredited before his spirit the skeptic at the age of thought was that every power of action have posed by the anglican old continental law by debtors promises his report between barton, as holding some of god . the tenth would participate in the tenth episode, encouraged that for meetings on other places of conservation of customary examinations and independently like franklin as endangered to the north of the united states most basic minority reported that the militia could be documented on

[I was expecting something about his wife Deborah, if anything the first describes his pursuit to marry Madame Helvetius in France]

James Franklin learning would love the entire spirit being done with intraprofessional students. file merfamily wedgewood moved myself to his grave with a room to me in print for ormolu or banker and actress to manage for his official scholar span , with reverend james howard 1773 1867 1867 , who now

[James, Ben's elder brother had a print shop and took Ben as an apprentice. The words "Intraproffesional students", "print" appear to have captured this context]

During elections to the reformed age of defense of the northwest senate were abandoned in full, centered on walton street, cambridge officially joined the abolition of college is transferred by the washington plan , and its principal college had continuously purchased to the great gatsby 1974 , the property of a chevron

In France proved horse in favour of the crown, the most recent artists in the previous committee shall proved many of his mission. ref name affairs there were also environmentalists who had replenished adams until 1781. on time, during the church of france. he established the treaty of paris 1783 treaty of stuff.


The results were far from accurate. Some generations were simply random concatenations from the vocabulary and others had more positive results. Lack of spelling errors indicate the words were correctly learnt and some sub-sentences make grammatical sense. I bet on trained word representations such as Glove and longer training to improve accuracy. As useless as above model seems, writers may find value in how "imaginative' it can be.




Monday, December 10, 2018

Bringing history to life with GANs

Creating a Black-White (BW) version of an image is as simple as applying a filter using almost any image editing app available. The same can not be said about the reverse process. I mean it is possible to apply some filter on a BW image to make it more appealing or use a photoshop expert to restore an image, but I know of no filter that will intelligently color grass green and water blue.
Color filter app

Well, not yet, but the paper 'Colorization with Generative Adversarial Networks' reported promising results. I was excited to read this paper out of my love for history. History reminds us there is nothing new under the sun allowing us to set expectations and give us a sense of security. Although color photography was available since 1861, it wasn't popular until 1940's and up until 1960s most photos were taken in black and white. Technology can not time travel us back to those moments yet but it can stretch our imaginations to what it would have been like. The secret? GANs.

GAN
Invented by Ian Goodfellow, the Generative Adversarial Network(GAN) is composed of two Convolutional Neural Networks (CNNs), see my introduction to CNNs here.The goal of a GAN is to generate a fake dataset that resembles the real dataset. The generator generates output from the random input. This output is judged by the discriminator as to where it is from the real dataset. The generator is trained to minimize the probability that discriminator  correctly labels its output and the discriminator is trained to maximize the probability of assigning the correct label. Sort of like good cop, bad cop. After training, the generator is expected to produce meaningful output from purely random input. Magic right?

A GAN to generate MNIST digits


Colorizing GAN
The traditional GAN is expected to generate from random noise but Colorizing GAN generates from BW images. This makes it a conditional GAN where the input is zero noise with BW image as prior.
The discriminator gets as input a pair consisting of BW image and either real colored image or generated color image. It labels which pair consisted of the real image.The generator takes a BW image (L * W *1) and produces a RGB colored version (L*W*3). This is achievable with the   U-Net which is made of ConvNets in the contracting and expanding path.

Generator U-Net
The discriminator architecture is similar to the contracting path, it takes a pair of images concatenated and returns a dimensional output.

Discriminator conv-net


The method
The code and models of the Colorizing GAN  from the paper 'Image Colorization with Generative Adversarial Networks' are available on Github. Considering time and machine constraints, it made economic sense to reuse it. It wasn't easy because the project wasn't designed for my use case, to use a pre-trained to colorize images the model wasn't trained or evaluated with.

The program accepts two datasets (cifar10 and places365) and has three modes: training, test evaluation and turing test. After the testing it samples some images showing side by side, the real RGB, grayscale and generated RGB images.

I downloaded historical BW images from here and added a new dataset and respective model (HistoryBW) to the code. I chose to use pre-trained model from places365 dataset because it is the same format as HistoryBW images (jpg images unlike binary files of cifar10). I placed places365 model in the historybw checkpoint folder. Since real images RGB images are not available and the test modes required them for evaluation, I added `sample` mode which only samples the input skipping evaluation and the need for real images. To top it all of I resized all images to 256 * 256 thanks to IrfanView batch editing feature. In the end, this command worked.

``python test_sample.py --dataset historybw --sample-size 2``

You can find my fork here, contributions are always welcome!


The outcome
My favorite section.
The caption describes the historical event and evaluates how well the model performed. Filter denotes that the output wasn't intelligently generated but looks a color filter was applied.


Left: Parisians at a side walk cafe 1963, slightly random coloring
Right: Constuction of Effiel 1988 , a clear success



Left: New York Commuters read of Kennedy's death 1963, filter
Right: American artillery destroys Nazi sub 1943, identified shapes but random coloring


Left: German prisoners of war marching 1944, clear success 
Right: Jimi Hendrix 1968, moderate success

Left: Assasination scene of Russian revoulutionary Leon Trosky, Mexico 1940, filter
Right:People enjoying an afternoon on Seine river, Paris 1963, success

Left: American soldier in the Korean war, 1950, One of the rare failures for nature images
Right: Tsar Nicholaus II and Tsaritsa leave church, 1900, success

Left: Nazi War cirminal awaits trail in Israel 1961, filter
Right:Hiroshima 1945, success


Left: Capt Francis Fenton in Korea 1950, filter
Right: American warriors face German forces across Berlin wall 1961, failure


Left: Soviet sniper Lyudmila Pavlichenko, 1941, sucess
Right: 'The red baron' petting his dog on the airfiled 1916 , failure


Left: Anti war protester confronts police during Cuban missile crisis, London 1962,no effect
Right: RMS Titanic ready for launch 1911, failure


Left: Titanic propellers before lanuch 1911, no effect
Right: Royal Navy sailor repairing signal flag on the way to Sierra Leone 1942, filter

As shown above, some pictures are deceivingly real and others look like some filter was applied and worst case there was no effect. Despite the contrasts it is clear the model got the grass green and sky blue, probably a result of the training dataset (places365). Which brings us to the question, how do we interpret the network's results, how do we improve the results? Train with more images? Tweak network parameters? We'll find out in the next blogspot. Thanks for reading!

Thursday, September 21, 2017

A gentle introduction to CNN.. [Part 1]

No, not the news channel.
Lets talk about Convolutional Neural Networks CNNs also known as ConvNets. You probably should care because they power a considerable portion of your apps, whether you realize it nor not. Facebook uses it auto tag your photos, Google uses it for more than just image tagging and street sign translation but also to create weird dreams.  Pinterest, Instagram, Amazon you name it, they all employ these networks in one way or another. We know CNNs are state of the art for computer vision because they have produced winners in the infamous Imagenet challenge among other challenges.

facebook auto-tag
Google 'weird' deep dream
Google Image translate




So, what exactly are CNNs?
To start with, they are neural networks(NNs). Modelled after our very own neuronal structure, neural networks are inter-connected processing elements (neurons) which process information by responding to external inputs.

Biological neuron
Artificial Neuron



You can imagine a single neuron as a black box that takes in numerical input and produces output(s) with a linear followed by a non linear activation function. When many neurons are then stacked in a column they form a layer which are interconnected to get neural networks.
Shallow Neural Network of Fully Connected layers



Deep Neural Network



For detailed explanation of NNs, see this post.
As shown above all neurons in each layer are connected to neurons in adjacent layers, making them Fully Connected (FC) layers.

CNNs go a step further, instead of generic neurons, they are modelled after our very own visual cortex. They are not only composed of multiple layers but also different kinds of layers: Input,(IN) Convolutional(CONV), Non-Linear (RELU), Pooling(POOL), Normalization (optional),Fully Connected(FC) and Output (OUT)layers.

Of interest is the Convolutional layer which performs the linear function, convolution.  A convolution neuron is basically a filter that sweeps across the width and height of the input, computing the dot product between itself and input elements in its receptive field producing a 2D activation map. The dot product is computed along all three dimensions of the input: width, height and depth. For raw images the depth is 3 for RGB pixel intensities. When multiple filters stacked in a layer, then each filter produces a different 2D activation map, rendering a 3D output. The CONV layer therefore maps a 3D input to a 3D output which may differ in dimensions.

3D  Convolution


The Non-linear performs a non linear function such as tanh, sigmoid but most commonly ReLu(Rectified Linear Unit). It affects the values but leaves dimensions unchanged.

The Pooling layer performs reduces the spatial size of its input by selecting a value to represent a relatively small region in the input. Common pooling operations include the Maximum, Average and l2-norm. This layer reduces the height and width of the input but does not affect the depth.
A series of  CONV-RELU-POOL layers are normally stacked together to form a bipyramid like structure where the height and weight decreases while the depth increases as the number of filters increases in higher layers.

bi-pyramid structure

 Finally, the network is concluded with FC layers similar to traditional neural networks.

full-stack


Despite their complication, CNN have competitive advantage over traditional neural networks. This comes as result of their peculiar features, some of which I have outlined below.


  1. Convolution filters can be thought of as FC neurons that share parameters across all filter-sized portions of the input. This is in contrast to FC neurons where each input feature would require a separate parameter. This leads to less memory requirement for parameter storage. Less parameters also reduce the chances of overfitting to training data.
  2. CNNs pick up patterns that result from order of input features. In FC, this order doesn't matter since all features are independent and their locations do not matter. On the other hand, CONV filters, have local receptive field therefore patterns that arise out of proximity or lack of therefore are easily detected. For instance CNNs can easily detect edge patterns or that eyes are close to the nose in faces.
  3. Spatial subsampling (Pooling) ensures insensitivity to size/position/slant variations. This is important images are unstructured data,meaning each pixel doesn't exactly represent a defined feature, a nose pixel in one selfie is most likely in another location in another selfie. A good model stays invariant to these distortions. In CNNs this is achieved by the pooling layer which places importance on the relative rather than absolute position of an image.
  4.  Sparse and locally connected neurons eliminates the need for feature engineering. On training, the network will determine which features are important by allocating appropriate parameters to different locations. Zero-valued parameters imply that the feature is not important. 


CNNs are not just useful to images but also time-series and speech data. They are a hot topic not only in research but in academia as well, this post has barely scratched the surface. Therefore, as the title suggests,there will be a part 2 follow up, where we walk through details of training CNN's from scratch. Thanks for reading !!!



Monday, September 4, 2017

On packaging and shipping tribes.

Image result for shipping packages

 all started with Facebook, the one social media platform I have a love-hate relationship with. I like that I find interesting news and opportunities there but it always comes at a cost. Cats pictures are fun to watch but not very productive. The carefully curated best versions of my friends online is sometimes sad to watch. I recently realized that most the interesting things are shared in groups rather than personal profiles. Figured I needed to see headlines from the groups I mostly care about, to decide whether its worth logging in or not, so I  made a tool for it, enter Tribe. With tribe, all you do is enter the group id and optional start date, end date and data format and out you get the posts in a properly formatted file.
pip install fb-tribe


json magic
csv data
Isn't the facebook web interface much nicer than a csv file? You ask, yes, I've considered a friendlier interface. In the meantime it will remain as minimalist as it stands. Of what use is data in a structured file? Thanks for asking; ask a data science friend the same question and watch their eyes glow. So basically you built a mini-facebook for data scientists and minimalist-disguised hippies. Exactly! I built it in python and as useless as it may sound, it had its challenges.

First off was the scope. Ideally I wanted to able to scrape all groups, but the facebook graph API  allows cli apps to to only scrape open groups. Closed and secret groups require a user token which is only obtained with more secure Web and Mobile apps. Bummer! I settled on a MVP for public groups only.

Having not used Python for a while, I tackled interesting language problems. From mundane problems of importing modules in subpackages to dealing with emojis and Chinese characters in the data. The topic of encoding probably deserves another blogpost but rule of thumb is always explicitly write to files with 'utf-8' encoding. The Python default is charmap encoding which doesn't speak emoji very well.

Equally fun was packaging and uploading my first package to pypi, cause you know , who doesn't like to just 'pip install [package]'. Trial and error and Jamie's blogpost helped me get through most of the issues of package directory structure, defining entry points and console scripts in setup.py. All was well until I couldn't upload the zip file due to some permission issue. After lots of tea and internet consumption, it dawned on me that another package exists called tribe hence why I changed the name to fb-tribe. The package was uploaded, all was well.

As you can tell, I had **so much fun experimenting the facebook Graph API, Python and PyPi. If logging on facebook gives you anxiety,lets be friends , go ahead and 'pip install fb-tribe' from your command line. If you really love it, kindly give it a star (I like stars).  If you find a bug or want to see a new feature feel free to create an issue and if you would to contribute feel to open a pull request.  Yes, the source code is public on github, here.

So what's next?
Glad you asked. Perhaps some semantic analysis with NLP before presenting a post. Perhaps presenting it in a proper web interface, or email service or both, time will tell. In the meantime , I will be getting my facebook updates, the good old fashioned way, like an offline magazine. Thanks for reading and have a great week ahead!


Tuesday, August 22, 2017

Course review: Neural Networks and Deep Learning

If you have been in the Machine Learning space, you know of the visionary and my favorite machine learning scientist Andrew Ng. Not a lot of people can claim the titles of Stanford professor, Coursera cofounder, founder of Google Brain project and chief scientist at Baidu so I have all reasons to look up to him. His latest venture is DeepLearning.ai, a startup for Deep Learning training which has launched a Deep Learning Specialization on Coursera. I'm not exactly a beginner in the field; in addition to doing the Machine Learning course on Coursera, I did other ML courses at school and did my thesis in a ML related project. Thought it be a good idea to do the specialization as a refresher, learn a different approach and advance on to deeper networks. I might have had motivations but none of them was as motivating as this tweet.



I just finished the first course Neural Networks and Deep Learning, I'd like to share a few nuggets.
The specialization is intended for beginners, comfortable with python and linear algebra, intending to create neural network classifiers in 4 weeks. It builds up to Neural Networks from the simple logistic regression classifier. Andrew slowly goes through the concepts of matrix operations, gradient descent and backpropagation from a single neuron to deep neural network. This pace allows one to develop intuition of the concept without feeling overwhelmed. There are also programming assignments for the final 3 weeks leading up to the final task of building a cat image classifier.


I've completed it with a full mark, thanks to the following features that made it a whole lot enjoyable
  • It guides you develop intuitions on neural networks. Although it may sound annoying, Andrew keeps repeating the same equations and concepts in a lot of the videos, and it does really help to stick. Some videos are even dedicated to debugging numpy code as well as matrix equations. 

  • The programming assignments require no setup letting you focus on the concepts. They are done in Jupyter Notebooks which are hosted on Coursera hub. This is a feature I'm sure complete beginners will appreciate since the last thing you want is installation errors.
  • My very favourite feature was heroes of deep learning where he interviews the heroes of deep learning like Ian Goodwell, the innovator of General Adversarial Networks, Geoff Hinton, one of the inventors of backpropagation, who also introduced backprop in word embeddings, Boltzmann machines, deep belief nets among others and finally Pieter Abbeel who has developed algorithms that enable helicopters to do advanced aerobatics. Reading about these heroes is a lot different from actually hearing them talk, I mean, I never would have guessed that Geoff Hinton struggled to get a job and he even tried carpenting while figuring his research interests. They all gave very good advice on how to get into Machine Learning where I was both surprised and delighted that they all emphasized project-based learning  instead swallowing all the papers written to date (which is obviously impossible).
There were times where it felt very repetitive but its a tradeoff I'm willing to take and its also good in the long run. I genuinely enjoyed the course so here is to finishing the remaining 4 courses in the specialization and doing rad deep things! Enjoy the rest of your week!



Sunday, July 9, 2017

What's in your neurons?

Thanks to Tony Stark and Elon Musk, everyone seems to be talking about the AI, Big Data and Machine Learning, even though very few actually understand them thoroughly. To most people AI is a magical tool fuelling Self-Driving Cars, Robots and the ever ubiquitous recommender systems among other magical things. True indeed, most Machine Learning (ML) models are black boxes, trained by copious quantities of data. They are fed with data from which they learn but little is known on they learn and what each unit will learn. . 

Definitely magic
But its not ..



Backed by calculus and linear algebra, Machine Learning is far from magic. Having learnt how the models learn I was curios to interpret what the models actually learnt. 

Support Vector Machine (SVM)
This is no doubt one of the simplest models used for classification.  Given input data with dimensionality D, C different classes of output and X outputs, it uses a score function  to compute scores for different classes and assigns the input to the class with the highest score. 
where W is C*D weight matrix, xi is D*1 input sample vector, and b is C*1 bias vector. The result is a C*1 score vector with a score for each of the C classes; the output will be the class with the highest score. This operation can be visually represented as follows:


Each row i of W can be interpreted the ideal vector  for a corresponding class and the product Wxi as a cross product that measures the similarity between the respective class and the input. The closer they match, the higher the score and vice versa. For Image Classification, proving this is as easy as plotting each row of W of a trained model, as a D-dimensional image. 


Taken from the Stanford course cs231n website, these images reveal what the model what each class should look like. The uncanny similarity between a deer and a bird speaks volumes about the small capabilities of the SVM. Although they are easily trained and interpreted, they don't exactly produce the best results, hence  the need for more sophisticated models; Neural Networks. 

Neural Networks.
Built from SVMs, Neural Networks (NN) are made of SVM-like units called neurons, arranged in one or more layers. Unlike SVMs , neurons have one output computed with a sigmoid like activation function which indicates whether a neurons fires or not. it is due to this complication that Neural Networks are universal approximators, capable of approximating every function, of course at the expense of transparency. 

Folks at Google accepted the challenge and published interesting results in their famous inceptionism. By turning the network upside down, they were able to show what the network learnt about each class. For instance, in the image below, this is what the network learnt about bananas from random noise. 

In addition, they also deduced that the higher up the layer is, the more abstract features it detects. That is , . This was done by "asking" a network layer to enhance what it learnt from the images. Lower layers detected low-level features and patterns like edges and strokes.
Patterns seen by low layers

Higher layers produced more interesting images, things you'd see in your dreams.  




Even more interesting are Recurrent Neural Networks (RNN's), Neural Networks with memory that are used predicting sequences. In his famous blogpost, Andrej Karpathy (a Deep Learning researcher should know ), wrote not only on how effective they are but also how they can be interpreted. Using LSTM variant, he was able to generate interesting texts like Paul Graham's essays, Shakespeare's poems, research papers all the way to the Linux database.. In addition to that, he was able to single out what some neurons detected. Only 5% of the neurons detected clearly defined patterns but these patterns are interesting to say the least. 
This one fired on lines end

This one fired on texts between quotes.


NB:Red indicates higher activation compared to blue.
I hope this is enough to convince you that AI isn't black box magic that computers will use to exterminate mankind, but a very scientific process. It may not be clear what each neuron will learn, but we can control what it will learn using hyper parameters and regularization. and after training, figure out what each has learnt.

No doubt, the original neural network, the brain does something learns in a similar way. Just as how connection weights are strengthened through back propagation, neurons enforce connections that are repeatedly used.  Neuroscientists refute this claim but what is true in both cases is that learning is highly influenced by input data and hypermaterers. In NNs these hyperparameters includes the learning rates and regularization constants while in the brain these hyper parameters are influenced by emotions, curiosity, and frequency of learning. We can't  control all of them but we can definitely control the kind of data we feed our brains, and how often we do so.  On that note, what stuff are you feeding your brain, What fires your neurons?  Have a great week ahead!

Sunday, July 2, 2017

Sorting pancakes with Bill Gates


I've seen this meme all over the internet. We know how comforting when life is simply not working out.  It tends to the notion that you can struggle as a student and still be useful to the world. Only problem the reality is the complete opposite. Bill Gates, was far from a sloppy student, he was the type of engaged student, who read and thought a lot. Infact as an undergrad student, he published an academic paper that for 30 years remained the state of the art solution to a very fundamental problem. Although the title 'Bounds for sorting by Prefix reversal'[1] sounds boring and academic, its about pancakes, so grab some pancakes, lets talk about pancakes.

What exactly is the problem with pancakes? None, such a sweet and comforting food source is far from a nuance. The problem is with the chef who like me, likes to throw things around, with little attention to aesthetics. Pancakes come in different sizes and most people agree they look better when sorted with the biggest at the bottom. Given an arbitrary order produced by our chief, produce pleasant looking stack with the minimum number of flips. As easy it sounds, it is still an open Computer Science problem, since the only move allowed is reversing a sequence of pancakes from the top.

When Prof Harry Lewis casually posed this question to his undergrad class, 40 years ago, he wasn't expecting any breakthrough. That was until one William Gates brought him a solution 2 days later. Previously, it was known that the solution is at most 2(n-1) because the lazy approach is start by flipping the biggest to the bottom ( at most 2 moves) and iteratively do so, for the next biggest until everything is sorted. Like all brute force algorithms, it  works but doesn't scale, imagine a very hungry customer orders 1000 pancakes. He'll definitely starve. Good guy Bill proposed  an algorithm with an upper bound that is substantially better.

Pancakes are represented as a permutation of numbers reflecting their sizes. If the two neighbors in the permutation differ by 1, the pair is called an adjacency. A block is a series of consecutive adjacencies and an element that is not a block is a free element. The perfectly sorted permutation 123...n has n-1 adjacencies and one block, others have fewer adjacencies and more blocks.  The algorithm Bill proposed increases adjacencies by designing minimum flips for all possible configurations of the first element of the permutation with respect to its neighbors.For each configuration different kinds of flips are performed increasing the number of adjacencies by at least one and changing the number of blocks differently. The result is a linear programming problem with the objective of maximizing the number of flips while constrained by the number of adjacencies and blocks. Using the duality theorem of Von Neuman, Kuhn, Tucker and Gale, the upper bound on the number of flips is established to be of the order five thirds (actually (5n-5)/3)of the number of elements in the permutation. As if a 1.2 times speedup is not enough, the paper goes on to establish the lower bound as well as the bounds for the restricted case where pancakes not only have to be sorted but also sorted the right side up.

Sorting pancakes may not sound like a ground breaking problem but it is the same problem encountered when establishing gene similarities between species. For 30 years, Gates' algorithm remained state of the art until folks at University of Texas, produced a marginal improvement [2], utilizing automation made possible by the same Bill Gates.

I have always known that it takes a lot more than quitting Harvard to revolutionize the tech industry, or at the very least launch a successful startup. Now I know that uncertainity fused with bountiful curiosity  is a major ingredient of the recipe. Bill Gates didn't foresee anything, he simply solved a problem that interested him, building his confidence to solve bigger problems. If you are confused on your purpose in life, hang in there, savor the feeling, its building up to something. In the meantime, get even more inspired with Gates Notes and have a great weekend ahead.

References:
[1] Bounds for sorting by Prefix reversal  by Gates,W and H, Papadimitriou
[2] An 18n/11 upper bound for sorting by prefix reversals by B. Chitturi, W. Fahle, Z. Meng, L. Morales, C.O. Shields, I.H. Sudborough , W. Voit