class: center, middle ### W4995 Applied Machine Learning # Word Embeddings 04/11/18 Andreas C. Müller ??? today we'll talk about word embeddings word embeddings are the logical next step in doing text analysis after working with topic models on Monday. FIXME: redo paragraph vector experiments FIXME: state of the art? FIXME: too short? FIXME: word embeddings with tensorflow? FIXME for CBOW and skip-gram shared weights for context FIXME for CBOW, which do we keep? --- # Beyond Bags of Words Limitations of bag of words: - Semantics of words not captured - Synonymous words not represented - Very distributed representation of documents ??? Last time we started talking about limitations of bag of words, such as semantics of words are not captured synonymous words are not represented - if I use two synonymous words, for example film and movie, they are represented in completely different ways and there is no connection between them. and we also have a very distributed representation of documents, which might be hard to understand or hard to learn from. --- # Last Time - Latent Semantic Analysis - Non-negative Matrix Factorization - Latent Dirichlet Allocation - All embed documents into a continuous, corpusspecific space. - Today: Embed words in a “general” space. ??? Last time we looked at latent semanic analysis, non-negative matrix factorization and latent dirichlet allocation. Each of them embedds the documents into a continuous, corpus specific space of a dimension of our choosing, using an unsupervised objective. The first two are matrix factorization, the third is a probabilistic generative model. Today we'll be talking about a more general embedding, which tries to embedd individual words, not documents. The goal is to learn an embedding that's generally useful, and not necessarily tied to a particular corpus. --- # Idea - Unsupervised extraction of semantics using large corpus (wikipedia etc) - Input: one-hot representation of word (as in BoW). - Use auxiliary task to learn continuous representation. ??? We'll talk about several methods, but they'll all use a very large corpus to create a relatively generic embedding. So we might use something like wikipedia or google news or a big corpus of books. It's an unsupervised task, but the way it is phrased is using an auxiliary supervised task to learn the embedding. And what I mean by that will be more clear shortly. The starting principle is a one-hot representation of each word, as in Bag of words, which basically just means each word is identified by some index. --- # Skip-Gram models - Given a word, predict surrounding word - Supervised task, each document yields many examples - Not interested in performance for this task, just want to learn representations. ??? A very commonly used auxiliary task is the skip-gram model. Here the task is, given a word, predict the surrounding words. This is a supervised task, with each document providing many examples, basically as many as there are words in the document. I'm saying this is an auxiliary task, because we're not really interested in actually solving this problem of predicting surrounding words, we just want to use this task to learn useful representations. --- class: left # Example ``` [“What is my purpose?”, “You pass the butter.”] ``` .center[ ![:scale 50%](images/context_window.png) ] Using context windows of size 1 (in practice 5 or 10): ??? Here's an example of what this task looks like. Let's say we have a corpus of two documents, "what is my purpose" and "you pass the butter". We specify a context window size, let's say one. Then, for each word that has enough context, we generate a training example. So for "is", we have a context of "what" and "my", for "my" we have "is" and "purpose", and so on. So the input would be "is" and the goal would be to predict that the word before is "what" and the word after is "my". that assigns a high probability to the outputs "what" and "my". In practice larger context windows are used, often 5-10 words in either direction. Cearly it's impossible to actually solve this task, because each word can appear in many different contexts. The goal is to learn which contexts are likely, and which are not. --- .center[ ![:scale 50%](images/cbow_skipgram_1.png) ] ??? Here’s the architecture that we will be using for this. It's kind of like neural network only, this is completely linear. In the left-hand side, we have the input layer with one hot encoding services in V-dimensional. V represents the size of vocabulary. And basically, only one of them will be 1 and the others will be 0. Then we have a weight matrix connecting it to the n-dimensional hidden layer. From there, we have multiple outputs that correspond to the previous word, previous 2 words, and previous 3 words and so on. And then the next word, next 2 words, next 3 words and so on. Each output is, again, one hot encoding, which allows you to predict each possible version of vocabulary. These layers here are very large in size. Behind the hidden layer is basically a logistic regression. So this is just a linear classifier with a softmax activation. So each of these Ws will try to predict what the words are before and after. Basically, you have one hot encoding, you've multiplied with the first matrix, and then you multiply it with another matrix and then you do softmax. The reason why you have these two matrix multiplication is basically you want to create a bottleneck so if you had a very wide vector, you could do one matrix multiplication, basically, you want this matrix to have a lower rank. So you want to compress a representation to something small. Each row of W corresponds to a single word. We train this whole thing in a supervised fashion to minimize the log loss. But we’re only interested in is getting weights w. --- #Softmax Training .center[ ![:scale 70%](images/softmax.png) ] .smallest[
Mikolov et. al. - Distributed Representations of Words and Phrases and their Compositionality (2013)
] ] ??? The problem is you sum over all the outputs. All the outputs are all the words in the vocabulary. If you have a million words in your vocabulary, this is going to be a lot of work. This is a step that you need to do every time when you want to optimize it. And so what most of the algorithms do is they just approximate the sum by sampling some words. So it just normalizes by some words, but not by all words. This is the model that's most commonly used. This is called skip grams. --- # CBOW vs Skip-gram .center[ ![:scale 70%](images/cbow_skipgram.png) ] .smallest[ Efficient Estimation of Word Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf ] ??? This is the skip-gram model, here you can see it on the left (FIXME?!) So we start with the one-hot-encoding of the word, and we have a matrix multiplication into a latent space. A matrix multiplication with a one-hot vector just means that each row in that embedding matrix represents one word. From this latent space, we run logistic regression, trying to predict what the word is for each particular offset, so here in this picture we're trying to predict the word two to the left, one to the left, one to the right, and two to the right. So these are basically four classifiers running on the same input. Another approach is shown on the left, the continuous bag of words model, or CBOW. This is something that seems a bit more natural to me, where given the context, we are trying to predict the word in the center. So here we would be given four context words and try to figure out what's the most likely word to be in between those. The skip grams work better. CBOW might capture some things like synonymous words because I can switch the synonyms word for another word, and it’ll still make sense in the same context. So if I say, I watched a movie or I watched the film, it doesn't really matter. --- # Implementations - Gensim - Word2vec - Tensorflow - Don't train yourself ??? Word2vec uses both the skip grams and continues bag of words. You can train this using TensorFlow. If you actually want to train it yourself, it's probably the best idea because then you can do it on GPU. These are usually like large matrices, you have a large corpus which will take forever. With TensorFlow, you can do it on a GPU and, and it makes things much more bearable. But generally, just don't train it yourself. You can download several of these word2vec models from Google, for example. --- # Gensim - topic models for humans - Multiple Latent Dirichlet Allocation implementations - Wrappers for Mallet and vopal wabbit - Tools for analyzing topic models - No supervised learning - Uses list-of-tuples instead of sparse matrices to store documents. ??? Gensim is a Python library for text analysis. Gensim is very easy to use and it has a lot of features. Last time, I mentioned spacy and NLTK as libraries for NLP, Gensim is the third one in that list which is very specific to topic models. And it has multiple implementations of LDA. It has wrappers for Mallet and vowpal wabbit, which also has very fast on LDA implementation. It has no supervised learning. This is all about document retrieval and feature extraction. The main thing to keep in mind when you work with this and when you combine this with scikit-learn is that it uses a list of tuples to store documents instead of sparse matrices. It's completely equivalent. It's just a different data structure. But if you want to mix this with scikit-learn, you need to make sure to convert it. --- # Introduction to gensim .smallest[ ```python docs = ["What is my purpose", "You bring butter"] texts = [[token for token in doc.lower().split()] for doc in docs] print(texts) ``` ``` [['what', 'is', 'my', 'purpose'], ['you', 'bring', 'butter']] ``` ```python from gensim import corpora dictionary = corpora.Dictionary(texts) print(dictionary) ``` ``` Dictionary(7 unique tokens: ['butter', 'you', 'is', 'purpose', 'my']...) ``` ```python new_doc = "what butter" dictionary.doc2bow(new_doc.lower().split()) ``` ``` [(3, 1), (5, 1)] ``` ```python corpus = [dictionary.doc2bow(text) for text in texts] corpus ``` ``` [[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1)]] ``` ] ??? Use NLTK or spacy or a regexp for tokenization in real applications. This method here doesn’t even handle punctuation. In gensim, a document is represented as list of tuples (index, frequency), corpus is a list of these. --- # Converting to/from sparse matrix .smallest[ ```python import gensim corpus ``` ``` [[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1)]] ``` ```python gensim.matutils.corpus2csc(corpus) ``` ``` <7x2 sparse matrix of type '
' with 7 stored elements in Compressed Sparse Column format> ``` ```python X = CountVectorizer().fit_transform(docs) X ``` ```python sparse_corpus = gensim.matutils.Sparse2Corpus(X.T) print(sparse_corpus) print(list(sparse_corpus)) ``` ```
[[(4, 1), (3, 1), (2, 1), (5, 1)], [(1, 1), (0, 1), (6, 1)]] ``` ] ??? most transformations is gensim are lazy: They yield a generator (like Sparse2Corpus) which can be converted to a list. Note that the CSC is transposed to what we would expect in sklearn. --- # Corpus Transformations .smaller[ ```python tfidf = gensim.models.TfidfModel(corpus) tfidf[corpus[0]] ``` ``` [(0, 0.5), (1, 0.5), (2, 0.5), (3, 0.5)] ``` ```python print(tfidf[corpus]) print(list(tfidf[corpus])) ``` ```
[[(0, 0.5), (1, 0.5), (2, 0.5), (3, 0.5)], [(4, 0.577), (5, 0.577), (6, 0.577)]] ``` ] ??? You can do a TFIDF model with Gensim. --- # Word2Vec in Gensim ```python from gensim import models w = models.KeyedVectors.load_word2vec_format( '../GoogleNews-vectors-negative300.bin', binary=True) w['queen'].shape ``` ``` (300,) ``` ```python w.vectors.shape ``` ``` (3000000, 300) ``` ??? There are 2 benefits to using these prebuilt models. A) It takes a long time to run to train these B) You probably don't have the corpus, you can download Wikipedia, but I don't think you can actually download the Google News corpus. There are 3,000,000 word vectors in total stored in this model. So there’s a vocabulary of size 3,000,000. Each of the words in the vocabulary has a 300-dimensional vector. I can only apply this to words that are within this vocabulary. If I have any words that are not in this vocabulary, then this model is useless. The first thing that we want to do is to check whether these representations actually semantic in some way. To do this, first, we need to define how can we measure similarity in this space. And the way that similarities usually measured between words is using the cosine similarity. --- # Cosine Similarity .padding-top[ `$$\Large \text{similarity}(v, w)=\cos(\theta) = \frac{v^Tw}{\lVert v\rVert \lVert w \rVert}$$` ] ??? This is quite related to Euclidean distance but basically, it normalizes the weight, the length of the vector. This is a set of a distance matrix that’s very commonly used in NLP. For example, if you want to compare if the two documents are similar. For documents in the bag of words representation, this might be quite intuitive, since it just sort of normalizes the length of document away. --- class: smallerr # Inspecting Semantics ```python w.most_similar(positive=['movie'], topn=5) ``` ``` [('film', 0.8676770925521851), ('movies', 0.8013108968734741), ('films', 0.7363011837005615), ('moive', 0.6830361485481262), ('Movie', 0.6693680286407471)] ``` ??? Now that we know how we can measure similarities in the space, we can find similar words. And so let's start with ‘movie’ and the top 5 matches for it (ordered from most similar to least similar top 5) This computes the cosine similarities between the 300-dimensional vector for movie and all the other 300-dimensional vectors for the 3 million other words and gives the 5 closest one. In this case, they are ‘film’, ‘movies’, ‘films’, ‘moive’, ‘Movie’. I guess ‘moive’ is a common misspelling. --- ```python w.most_similar(positive=['good'], topn=5) ``` ??? Trying the same for the word ‘good’ gives ‘great’, ‘bad’, ‘terrific’, ‘decent’ and ‘nice’. ‘bad’ is an antonym not a synonym for ‘good’, this makes sense since sometimes if we replace ‘good’ with ‘bad’, it will still make sense. --- ``` [('great', 0.7291510105133057), ('bad', 0.7190051078796387), ('terrific', 0.6889115571975708), ('decent', 0.6837348341941833), ('nice', 0.6836092472076416)] ``` -- ```python w.most_similar(positive=['cute', 'dog'], topn=5) ``` ``` [('puppy', 0.7645270228385925), ('chihuahua', 0.7209305763244629), ('adorable_puppy', 0.7108923196792603), ('yorkie', 0.7013816833496094), ('Shitzu', 0.7004074454307556)] ``` ??? You can also look at multiple vectors and take the average and then look what's close to that. I’ve computed for ‘cute’ and ‘dog’. --- # Prepare document (IMDB) .smallest[ ```python print(text_train_sub[0].decode()) ``` ``` Maybe it's just because I have an intense fear of hospitals and medical stuff, but this one got under my skin (pardon the pun). This piece is brave, not afraid to go over the top and as satisfying as they come in terms of revenge movies. Not only did I find myself feeling lots of hatred for the screwer and lots of sympathy towards the "screwee", I felt myself cringe and feel pangs of disgust at certain junctures which is really a rare and delightful thing for a somewhat jaded horror viewer like myself. Some parts are very reminiscant of "Hellraiser", but come off as tribute rather than imitation. It's a heavy handed piece that does not offer the viewer much to consider, but I enjoy being assaulted by a film once and awhile. This piece brings it and doesn't appologize. I liked this one a lot. Do NOT watch whilst eating pudding. ``` ```python vect_w2v = CountVectorizer(vocabulary=w.index2word) vect_w2v.fit(text_train_sub) docs = vect_w2v.inverse_transform(vect_w2v.transform(text_train_sub)) docs[0] ``` ``` array(['in', 'for', 'that', 'is', 'the', 'at', 'not', 'as', 'it', 'by', 'are', 'have', 'an', 'this', 'they', 'but', 'one', 'which', 'do', 'than', 'over', 'just', 'some', 'like', 'only', 'did', 'because', 'off', 'being', 'my', 'very', 'much', 'go', 'under', 'does', 'got', 'top', 'come', 'really', 'lot', 'find', 'thing', 'once', 'offer', 'feel', 'film', 'medical', 'terms', 'rather', 'certain', 'felt', 'consider', 'watch', 'parts', 'heavy', 'towards', 'enjoy', 'feeling', 'maybe', 'piece', 'fear', 'myself', 'stuff', 'handed', 'brings', 'movies', 'hospitals', 'rare', 'lots', 'skin', 'eating', 'intense', 'somewhat', 'liked', 'afraid', 'tribute', 'horror', 'revenge', 'brave', 'whilst', 'sympathy', 'assaulted', 'satisfying', 'hatred', 'viewer', 'awhile', 'pardon', 'delightful', 'disgust', 'imitation', 'pudding', 'cringe', 'jaded', 'pun', 'pangs', 'doesn', 'junctures', 'hellraiser', 'appologize'], dtype='
Find closest vector to vec(“Queen”) + (vec(“Kings”) - vec(“King”)) ] .smallest[
Mikolov et. al. Linguistic Regularities in Continuous Space Word Representations (2013)
] ??? You can use these vectors to find relationships. So people found that if they look at the difference between men, and women, it looks very similar to the difference between uncle and aunt or king and queen. They also found that, for example, the difference between King and kings is the same as between queen and Queens. If you want to ask a question, what is to Queen as kings are to King? You can take vector representation of Queen and add the vector representation of kings, and subtract the vector representation of King. And if you look at the closest point to that, you'll find it's going to be queens. People did this with a lot of different sort of semantic pairs. --- .center[ ![:scale 80%](images/analogues.png) ] .smallest[
Mikolov et. al. - Distributed Representations of Words and Phrases and their Compositionality (2013)
] ??? One thing that's quite interesting is people embedded countries and their capitals using a 300-dimensional space and then did the PCA on it. This is projecting down to 2-dimensional space, all different countries and their capitals. Even after this projection, they all look pretty peril. And so basically, the thing that goes from Poland to Warsaw is basically the same that goes from Japan to Tokyo. You can use this to discover analogies or relations. --- # Finding Relations
$$\large a : b : : c : ?$$
$$\large d = \text{arg}\max_i\frac{\left(\text{vec}(b) - \text{vec}(a) + \text{vec}(v)\right)^T vec_i}{\lVert\text{vec}(b) - \text{vec}(a) + \text{vec}(v)\rVert\lVert \text{vec}_i\rVert}$$ ??? The vec(v) in the numerator should be vec(c) --- class: middle .center[ ![:scale 70%](images/table_cities_states.png) ] .smallest[
Stanford CS 224D: Deep Learning for NLP
] ??? The model is able to figure out what the relationship of what does it mean for a city to be in a state. So now we have a space in which vector additions have semantic meaning. When people found about this, they were really excited. --- # Examples with Gensim .smaller[ ```python w.most_similar(positive=['woman', 'king'], negative=['man'], topn=3) ``` ``` [('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399)] ``` ```python w.most_similar(positive=['woman', 'he'], negative=['man'], topn=3) ``` ``` [('she', 0.8492251634597778), ('She', 0.6329933404922485), ('her', 0.6029669046401978)] ``` ```python w.most_similar(positive=['Germany', 'pizza'], negative=['Italy'], topn=3) ``` ``` [('bratwurst', 0.5436394810676575), ('Domino_pizza', 0.5133179426193237), ('donuts', 0.5121968984603882)] ``` ] ??? In the first example, I'm computing ‘woman’ + ‘king’ – ‘man’ and the top 3 results are ‘queen’, ‘monarch’ and ‘princess’. Queen is the highest since ‘man’ is to ‘king’ as ‘women’ is to ‘queen’. So relationships that are found here definitely depend a lot on the corpus that was used. --- class: center
.center[ ![:scale 70%](images/paper_title.png) ]
.smaller[ $\overrightarrow{man} - \overrightarrow{woman} \approx \overrightarrow{king} - \overrightarrow{queen}$
$\overrightarrow{man} - \overrightarrow{woman} \approx \overrightarrow{computer programmer} - \overrightarrow{homemaker}$ ] ??? This relation is actually in there. If you do ‘man is to computer programmer’ as ‘women is to x as to x, you get ‘homemaker’ and we might not be very happy about this. --- # Going along he-she direction:
.center[ ![:scale 90%](images/he_she.png) ] ??? They did an analysis of this particular embedding. And established that we don’t want these gender stereotypes in our model and gender appropriate he-she analogies Some of them are clearly something that we would not want. And this is just one particular instance. As with all machine learning models, this learns all the biases of the data that you give it, if you give it a lot of text, then it uses that to learn all the biases of human society. --- class: middle # Paragaph Vectors ??? Alright, so the next thing I want to talk about is a slightly smarter way to look at embedding documents, which is paragraph vectors. Instead of just taking averages we're trying to do a little bit better. Fixme rerun experiments from doc2vec paper to show it works? --- # Doc2Vec
.center[ ![:scale 70%](images/doc2vec.png) ] .smallest[
Le, Mikolov: Distributed Representations of Sentences and Documents (2014)
] ??? based on cbow, not skip-grams Add a vector for each paragraph / document, also randomly initialized. To infer for new paragraph: keep weights fixed, do stochastic gradient descent on the representation D, sampling context examples from this paragraph. FIXME better explanation. There is no gradient descent at encoding time for word2vec! The most common approach here is Doc2vec. This here on the right-hand side would be the continuous bag of word model where you get the context, and where you want to predict the word from the context. What paragraph vector now does is it adds in another thing, which is a paragraph specific ID. It’s basically like adding a unique word to a vocabulary that’s just unique to this paragraph and encodes its presence in this paragraph. And you do not only learn the embedding of all the words, but you also learn embedding for each paragraph. And so basically, this tries to make up for what the words cannot predict. So you have the standard CBOW in the middle layer that tries to make a prediction but then basically trying to correct this prediction by having a paragraph specific ID. This maybe not sorts of the most intuitive way. But it actually seemed to work quite well. This is actually in practice quite a bit trickier than just looking at the word vectors since how are you going to do this if you get a new document? If you get a new document for the word vectors, you just look for each word, if it's not in the vocabulary you’re lost out anyway, if it’s in the vocabulary, you can just take the vector for this word that you have computed already. If you get a new paragraph, how do you get the paragraph vector? Since you have a new paragraph that has never seen before, the paragraph is not in the training set. And so what you do then you look at the paragraph and you sample training samples from this paragraph, so your sample context and words. And you fix all the word vectors because you learned them on the training sets. You're basically learning this paragraph ID by doing gradient descent. So you initialize this paragraph vector randomly and to transform a new paragraph, you basically learn the paragraph vector for this paragraph, fixing all the word vectors. In this sense, it’s more of trying to correct what the model could not figure out with the word vectors. So now, you have a vector that represents everything in the paragraph that was not already encoded in the context, or it might encode like in a wider context or something like that. And usually, it's just the same dimensionality as the other vectors. So here, we would get like a 300-dimensional vector that encodes each paragraph. But to encode a new paragraph you need to do gradient descent. --- #Doc2Vec with gensim .smaller[ ```python def read_corpus(text, tokens_only=False): for i, line in enumerate(text): if tokens_only: yield gensim.utils.simple_preprocess(line) else: # For training data, add tags yield gensim.models.doc2vec.TaggedDocument( gensim.utils.simple_preprocess(line), [i]) train_corpus = list(read_corpus(text_train_sub)) test_corpus = list(read_corpus(text_val, tokens_only=True)) model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2) model.build_vocab(train_corpus) model.train(train_corpus, total_examples=model.corpus_count, epochs=55) ``` ] .smallest[ https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb ] ??? fixme example of nearest neighbors to see if this learned something useful! Should use the unlabeled reviews! So here, basically, I started completely fresh from no model at all. And I learned embedding for all of the words and for all the paragraphs. So I did a small validation to see if the word vectors that I learned are meaningful. --- # Validation of Word Vectors ```python model.wv.most_similar("movie") ``` ``` [('film', 0.9482318758964539), ('flick', 0.822228193283081), ('series', 0.715380072593689), ('programme', 0.7032747268676758), ('sequel', 0.6939107179641724), ('story', 0.6771408319473267), ('show', 0.6559576392173767), ('documentary', 0.6537493467330933), ('picture', 0.6427854299545288), ('thriller', 0.6300673484802246)] ``` ??? I looked at the most similar ones to ‘movie’ and we get these results. I like this better than the ones from the pre-built one. These are pretty good but it's maybe not as surprising as ‘movie’ is sort of the most common word in the whole corpus. --- #Encoding using doc2vec .smaller[ ```python vectors = [model.infer_vector(train_corpus[doc_id].words) for doc_id in range(len(train_corpus))] X_train = np.vstack(vectors) X_train.shape ``` ``` (18759, 50) ``` ```python test_vectors = [model.infer_vector(test_corpus[doc_id]) for doc_id in range(len(test_corpus))] X_test = np.vstack(test_vectors) ``` ] ??? Now I can encode my training corpus. I get a 50-dimensional vector for each movie review in the training dataset. If I want to do classification I can run a logistic regression on this vector. So model.infer_vector does the gradient descent to compute the paragraph vectors for each document. In the paper, they used much more data from the same corpus and they used 400-dimensions and probably they also iterated longer. In the paper, they didn't train a logistic regression on this. They trained a neural network on this, this probably makes a difference --- ```python from sklearn.linear_model import LogisticRegression lr = LogisticRegression(C=100).fit(X_train, y_train_sub) lr.score(X_train, y_train_sub) ``` ``` 0.817 ``` ```python lr.score(X_val, y_val) ``` ``` 0.803 ``` ??? Not working well here. Either not enough training data, sgd didn’t converge yet, or too low dimension. FIXME combine with word vectors? In the paper: they actually get 0.926% 400 dimensions (And using unsupervised data) --- # GloVe: Global Vectors for Word Representation $X_{ij}$ = How often does work j appear in context of word i `$$ J = \sum\limits_{i,j=1}^{V} f(X_{ij}) (w_i^T \tilde{w_j} + b_i + \tilde{b_j} - \log X_{ij})^2$$` `$$ f(x) = \begin{cases} (x/x_\text{max})^\alpha & \text{if } x < x_\text{max} \\ 1 & \text{otherwise} \\ \end{cases} $$` .smaller[ https://nlp.stanford.edu/projects/glove/ ] ??? Here X_ij is the coocurrance count matrix, counting how often word I and j appear in the same context – say a 5 word window. We are learning a factorization of log(X_ij) into w_i and ~w_j, where w_i are the word-vectors we want to extract. Very infrequent word-pairs are less important to get right, and are down-weighted using f(X_ij). alpha is 3/4 empirically. x_max is 100. --- class: center, spacious # GloVe Weighting function f ![:scale 90%](images/glove_weighting.png) ??? If you're larger than X max, it's just weighted by 1. If all words appear at least 100 times then the loss is weighted fully and if words appear less than 100 times, then you take x divided by 100 to the power of alpha, and they found alpha equal to 0.75. So it works better than linear. So in the beginning, you have to compute this big sparse matrix once, but then you can do gradient descent just on this given statistics. --- # Word analogies .center[ ![:scale 45%](images/word_analogies.png) ] ??? Comparison of CBOW, SGD, skip-gram and Glove on the semantic (ie. state→capital) and syntactic (ie singular→ plural) word analogy tasks --- class: middle # Questions ?