class: center, middle ### W4995 Applied Machine Learning # Working with Text Data 04/08/20 Andreas C. Müller ??? Today, we'll talk about working with text data. FIXME CountVectorizer in Column transformer is weird, no list, just string. Example of combining text and non-text features. FIXME explain L2 normalization and intuition? FIXME BOW figure needs repainting FIXME hashing vectorizer slide missing score output FIXME a bit too long, remove european parliament example? --- # More kinds of data - So far: * Fixed number of features * Contiguous * Categorical -- - Next up: * No pre-defined features * Free text * Images * (Audio, video, graphs, ...: not this class) ??? So far we talked about data where we have a fixed number of features, and where the features are either continuous or categorical. And that is most of the data out there and the simplest kind of data to work with. The rest of the semester, we're going to talk about different kinds of data that is opposite to what we already worked with. We'll look at data that has no predefined features. In particular, we're going to look at free text and images. We’re also going to look at time series, which is a similar category of not having a clear way to express it as a standard supervised learning task. Another common example in this category is audio and video, we're not going to do any audio or video in this lecture. But I think text and images are the common types of data. * Need to create fixed-length description --- # Typical Text Data
.center[ ![:scale 100%](images/typical_text_data_1.png) ]
.center[ ![:scale 100%](images/typical_text_data_2.png) ] ??? Here's an example of some text dataset that I'm going to use later. This is like relatively typical text, its user input data from IMDb movie reviews. Clearly, they don't have the same length, there's no clear way to express this is as like some floating point features which we would use for a standard machine learning approach. You can also see there are weird things about capitalization, some weird words in there like Godzilla that you might not see very often, there is some HTML markup, there’s punctuation and so on. This is sort of how text looks like in the wild very often. You could also do things on much longer texts, like whole books, or articles or so on. People also work on text that's much shorter. Tweets are still sort of their own domain. Working on tweets is quite hard because they are very short. I’m going to talk about a particular kind of data that you see very often because it's often used. But there are other kinds of data that is text data. --- # Other Types of text data .center[ ![:scale 100%](images/other_types_of_text_data.png) ] ??? Here is a table of the text data consisting of people sitting in the European Parliament. Each person has a country, a name, an ID, a party, and a political affiliation. The variable country here, you might hope that it's categorical but for the European Union, actually, the categories are well known. Though, if you have a user input a country, they will never input the same country in the same way. The UK could be the United Kingdom, or England, or Britain, or Great Britain, and all of these might correspond in this context to the same entity. If you have users input this, they will also make typos. This variable is representing a category concept but you have to do some cleaning before you can actually convert it into a category. This is often a manual thing to do. There are things that can help you with this, like duplication or an entity recognition but we're not going to talk about that. The names are strings. They are definitely not categorical, and they’re also not words. So it's not really clear if there’s like any information in the name and we're going to look into this later today. But you clearly don't want to treat them the same way as words that you find in a dictionary. We have the political groups and affiliations. Again, these are sort of free texts. They're sort of similar to the variable country only now there's many more of them, in a sense, they try to encode a categorical variable, which political group you're from, but they also have more information, for example, you could also treat them as texts like if there's freedom or democracy or Christianity or something in the party title that will probably say something. So you could either treat it as categorical or as text and both will give you information. And you have the same problem, that the same group can be referred with different names or different spellings. So there's this whole area of named entity recognition that actually tries to extract from any text the named entities and then tries to disambiguate. We're actually going to talk about this, but it's also some particular kind of text data that I want you to be aware of. --- # Bag of Words .center[
![:scale 85%](images/bag_of_words.png) ] ??? The most common approach that people use for this is bag of words, which is a very simple approach. We start with a string which would be your whole document. The first step is to tokenize it, which is basically breaking it up the document into words. And then, you built a vocabulary over all documents over all words. So you look over all the movie reviews, or all the emails or the text you have, and you collect all the tokens which are representing the words. Finally, using the vocabulary, you create a group representation. The length of your feature vector will be the size of the vocabulary. For each word in the vocabulary, you count how often this word appears in the string that you want to encode. Most of the words in the English language don't appear in a sentence. So most of these entries will be 0. But the word ‘ants’ appears once, so increment the count for ‘ants’ by 1 and so on. So I'll have 6 non-zero entries and all the other entries will be 0. So this is what we call like a very sparse vector so most entries of this vector are 0. When coding this, we’ll use a sparse matrix encoding. The idea of sparse matrix encoding is that you only store the non-zero entries. So basically, we will only store the 1s, so storing the string as a bag of word will cost us a nearly 6 floating point numbers. --- # Toy Example .smaller[ ```python malory = ["Do you want ants?", "Because that’s how you get ants."] ``` ```python from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'because', 'do', 'get', 'how', 'that', 'want', 'you'] ``` ```python X = vect.transform(malory) print(X.toarray()) ``` ``` array([[1, 0, 1, 0, 0, 0, 1, 1], [1, 1, 0, 1, 1, 1, 0, 1]]) ``` ] ??? Here, I have a corpus. Corpus is what people in NLP called datasets. My dataset is 2 strings, “Do you want ants?”, “Because that's how you get ants.” The bag of word representation is implemented in the count vectorizer in scikit-learn, which is a sklearn.feature_extraction.text This is a transformer, it's kind of similar to most transformers. The only difference is it takes in lists of strings as input, not numeric data. Usually, transformers take in a Numpy area as anything in scikit-learn. But here, count vectorizer takes in a list of strings. I gave the list of string and I fit it on my documents and this builds the vocabulary. I can get back to vocabulary by looking at the feature names using get_feature_names() The vocabulary is sorted alphabetically. Vocabulary has 8 entries from ‘ants’ to ‘you’. So if I use this count vectorizer to create like a word representation, then I'll get an 8-dimensional feature space. I can use this to transform the 2 documents that I have. X here, being the outcome of the transform will be a sparse matrix. So to print it, I call toarray() on it. So sparse matrices are in scipy, and you can easily convert to and from numpy arrays. Usually, it's a very bad idea to convert to sparse matrix into Numpy array, since usually, it will not fit into your memory. If you have like a vocabulary of size 100,000 and 100,000 documents, and you make it into Numpy array, then your memory will just fill up and your computer will crash since you won’t have enough memory to store all the zeros. In this toy example, we can easily convert this to a Numpy array and you can see that this is the feature representation. In this case, we only have 1s and 0s. Each word appears either once or zero times. So here, in the first string, the first feature is ‘ant’, so ‘ant’, ‘do’, ‘want’ and ‘you ’ appears once, ‘because’, ‘get’, ‘how’, and ‘that’ doesn't appear. The second string is processed in the same way. Q: What would you do if your document has more unique tokens or tokens that are not in the vocabulary? Usually, you either ignore them, that’s the default thing to do. Your test set will have tokens that your training set will not have, and so you can’t really learn anything about that. One thing you could do is count how many words out of the vocabulary there are. We going to talk a little bit later about how to restrict your vocabulary and you can basically add a new feature, that says, “There's X amount of features that are not in the vocabulary.” Often, this can be useful for people cascading something or miss typing something or using names or something like that. But usually, you assume that most of the words appear in your training data set. Q: What if there are spelling errors? You can treat them but by default, if you do that, they're just different features. So they're completely unrelated. So if there's a new spelling error in your test dataset that never appeared in your training dataset, then the word will be completely ignored. You can use spell correction, I’ll talk about that in a bit. If you don't do any spell correction, it will be a completely distinct feature. Spell correction is also not entirely trivial because there are many possible words. Q: Does the count vectorizer trim the apostrophe in ‘that’s’ in the second string? Yes, and as you can see, the question marks and the dots also don't appear. And we're going to talk about how exactly it works in a little bit. Consider two documents in a dataset "malory" --- class: spacious # "bag" ```python print(malory) print(vect.inverse_transform(X)[0]) print(vect.inverse_transform(X)[1]) ``` ``` ['Do you want ants?', 'Because that’s how you get ants.'] ['ants' 'do' 'want' 'you'] ['ants' 'because' 'get' 'how' 'that' 'you'] ``` ??? The reason why this is called a bag of words is since you're completely ignoring the order of words. So you can do inverse transform to get back the string representation from the numerical representation but if you do that the numerical representation doesn't contain the order of the words so you will only get back ‘ants’ ‘do’ ‘want’ ‘you’ and ‘ants’ ‘because’ ‘get’ ‘how’ ‘that’ ‘you’ Basically, you store all the words in a big bag and completely lose any context and any order of the words. That's why it's called a bag of words. --- class: center, middle # Text classification example: # IMDB Movie Reviews ??? We're going to do a binary classification task on a dataset of IMDb movie reviews. You can look on GitHub, I have the notebook that runs through all of this if you want to run it yourself. The idea here is that we want to do sentiment analysis basically classifying reviews into either being positive or negative. In IMDb, they give stars from 0 to 10. This is from a research paper where the author stated 1, 2 and 3 as negative while 7, 8, 9 and 10 are positive. --- # Data loading ```python from sklearn.datasets import load_files reviews_train = load_files("../data/aclImdb/train/") text_trainval, y_trainval = reviews_train.data, reviews_train.target print("type of text_train: ", type(text_trainval)) print("length of text_train: ", len(text_trainval)) print("class balance: ", np.bincount(y_trainval)) ``` ``` type of text_trainval:
length of text_trainval: 25000 class balance: [12500 12500] ``` ??? Text data is in either it's in a CSV format or if they’re longer documents, people in NLP like to have a single text file for each data point. There's a tool in scikit-learn that allows you to load data, method load_files function, which basically iterates over all the folders in a given folder. Each folder is supposed to correspond to a class and then inside each folder, each text document corresponds to a data point, that's a very common format that people use for text data. We can load this, then we get the actual data and the targets. Text_trainval is used as a training and validation set is just a list. The length is 25,000. So there are 25,000 documents, and this is a balanced dataset meaning there are 12,500 positive and negative samples. --- # Data loading ```python print("text_train[1]:") print(text_trainval[1].decode()) ``` .smaller[ ``` text_train[1]: 'Words can't describe how bad this movie is. I can't explain it by writing only. You have too see it for yourself to get at grip of how horrible a movie really can be. Not that I recommend you to do that. There are so many clichés, mistakes (and all other negative things you can imagine) here that will just make you cry. To start with the technical first, there are a LOT of mistakes regarding the airplane. I won't list them here, but just mention the coloring of the plane. They didn't even manage to show an airliner in the colors of a fictional airline, but instead used a 747 painted in the original Boeing livery. Very bad. The plot is stupid and has been done many times before, only much, much better. There are so many ridiculous moments here that i lost count of it really early. Also, I was on the bad guys' side all the time in the movie, because the good guys were so stupid. "Executive Decision" should without a doubt be you're choice over this one, even the "Turbulence"-movies are better. In fact, every other movie in the world is better than this one.' ``` ] ??? We can also look at data points. I’m calling decode here to print the Unicode characters. The word ‘cliché’ can be typed with or without the axon and they would be two different words. But also, you need to determine if it’s a Unicode character or not. In Python 3, by default, everything is a unique code. --- # Vectorization ```python text_train_val = [doc.replace(b"
", b" ") for doc in text_train_val] text_train, text_val, y_train, y_val = train_test_split( text_trainval, y_trainval, stratify=y_trainval, random_state=0) vect = CountVectorizer() X_train = vect.fit_transform(text_train) X_val = vect.transform(text_val) X_train ``` ``` <18750x66651 sparse matrix of type '
' with 2580448 stored elements in Compressed Sparse Row format> ``` ??? I removed all the irrelevant HTML formatting. And then I split it into training and test set. Then I called the count vectorizer. Then I fitted the training set and transform the validation set. Then it returns a 18750x66651 sparse matrix meaning there are 18,750 samples and 66,651 features. So the vocabulary that we built is 66,651. There’s 2.5 million stored elements (non-zero entries). This is a much, much smaller number than the product of these two numbers. So most of the entries are zero. Remember, you need to know whether you are in Python 2 or 3 and then you need to know the type of text. There are two different things that you need to keep in mind. So the text can be a byte string or Unicode string. But if it's a byte string, it also has an encoding attached with it. You need to know the encoding to go from the byte string to the string. --- # Vocabulary ```python feature_names = vect.get_feature_names() print(feature_names[:10]) ``` .smaller[ ``` ['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007'] ``` ] ```python print(feature_names[20000:20020]) ``` .smaller[ ``` ['eschews', 'escort', 'escorted', 'escorting', 'escorts', 'escpecially', 'escreve', 'escrow', 'esculator', 'ese', 'eser', 'esha', 'eshaan', 'eshley', 'esk', 'eskimo', 'eskimos', 'esmerelda', 'esmond', 'esophagus'] ``` ] ```python print(feature_names[::2000]) ``` .smaller[ ``` ['00', 'ahoy', 'aspects', 'belting', 'bridegroom', 'cements', 'commas', 'crowds', 'detlef', 'druids', 'eschews', 'finishing', 'gathering', 'gunrunner', 'homesickness', 'inhumanities', 'kabbalism', 'leech', 'makes', 'miki', 'nas', 'organ', 'pesci', 'principally', 'rebours', 'robotnik', 'sculptural', 'skinkons', 'stardom', 'syncer', 'tools', 'unflagging', 'waaaay', 'yanks'] ``` ] ??? This is the first thing I do, it’s helpful to see if something sensible happened. And you get an idea of the data. I'm going to plot the first 10 data points, and I'm going to plot some 20 in the middle and I'm going to plot every 2000th data point. The first couple of ones seem to be pretty boring. But then it seems to do something relatively reasonable. Now that we have the sparse matrix representation, we can just do classification as usual. --- # Classification
```python from sklearn.linear_model import LogisticRegressionCV lr = LogisticRegressionCV().fit(X_train, y_train) ``` ```python lr.C_ ``` ``` array([ 0.046]) ``` ```python lr.score(X_val, y_val) ``` ``` 0.882 ``` ??? All the models in scikit-learn work directly on the sparse matrix representation. Here, I do logistic regression CV so it adjusts CR automatically for me. The validation set score I get is 88% accuracy, which is pretty good. Here, accuracy is meaningful because we know it's a balanced dataset. The next thing I always like to do is look at the coefficients. --- class: middle .center[ ![:scale 100%](images/coefficients.png) ] ??? So here I'm looking at the 20 most negative and 20 most positive coefficients of this logistic regression. It's pretty easy to do because it's a binary task. If you have multiclass tasks, you’ll have way more coefficients and it's harder to visualize. This seems to be like a pretty reasonable model to pick up on the bad words and the good words. This is the baseline approach. --- class:spacious # Soo many options! - How to tokenize? - How to normalize words? - What to include in vocabulary? ??? In tokenization, there are many different options to change what's happening here. In particular, there's the tokenization, which is how do you break it up into tokens. Then normalization, which is how do you make tokens look sort of reasonable. And then what to include in vocabulary. --- class: spacious # Tokenization - Scikit-learn (very simplistic): * `re.findall(r"\b\w\w+\b")` * Includes numbers * discards single-letter words * `-` or `'` break up words ??? Scikit-learn is not an LLP library. Some very good Python LLP libraries are NLTK and spacy. They have a lot more things for doing text analysis. Scikit-learn only has quite simple things. In tokenization, what the count vectorizer does the regular expression, “\b\w\w+\b”. That means it finds anything of length 2 or longer that has word boundaries, which are any punctuation or white spaces, and matches w. As we saw, this includes any numbers. Numbers can be at the beginning of the word, the middle of the word or it can be just all numbers. You don't include single letter work or single letter numbers. So ‘I’ is always discarded. There’s an interesting thing where you can get the gender of a person, depending on how much often they use ‘I’. It doesn't include any punctuation because we only use the w, which is digits. This is pretty simple and pretty restrictive. --- # Changing the token pattern regex ```python vect = CountVectorizer(token_pattern=r"\b\w+\b") vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'because', 'do', 'get', 'how', 's', 'that', 'want', 'you'] ``` -- ```python vect = CountVectorizer(token_pattern=r"\b\w[\w’]+\b") vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'because', 'do', 'get', 'how', 'that’s', 'want', 'you'] ``` ??? If we just want to use a different regular expression, we can do something simple, like changing the regular expression. So here, for example, I removed one of the Ws and so now I match everything that's a single letter or more. So now I only the ‘s’ in ‘that’s’, without the apostrophe, making it a single token. I can also allow things to have an apostrophe in the middle of the word, and get ‘that’s’ as a single token. This might look nicer but actually, in reality, it usually doesn't matter that much. But you can play around with this and see if the tokens you get with standard tokenization are good for your application. Again, spacy and NLTK have much more sophisticated things. The X in vectorization slide represents the sparse matrix representation of the data and Y represents if it’s a positive or a negative review. Was not actually an apostroph but some unicode pattern because I copy & pasted the quote. --- #Normalization .smaller[ - Correct spelling? - Stemming: reduce to word stem - Lemmatization: smartly reduce to word stem ] -- .smaller[ "Our meeting today was worse than yesterday,
I'm scared of meeting the clients tomorrow." Stemming:
` ['our', 'meet', 'today', 'wa', 'wors', 'than', 'yesterday', ',', 'i', "'m", 'scare', 'of', 'meet', 'the', 'client', 'tomorrow', '.'] ` Lemmatization:
` ['our', 'meeting', 'today', 'be', 'bad', 'than', 'yesterday', ',', 'i', 'be', 'scar', 'of', 'meet', 'the', 'client', 'tomorrow', '.'] ` ] -- .smaller[ - scikit-learn: * Lower-case it * Configurable, use nltk or spacy ] ??? Normalization is basically how do you want to bring your token in a standardized form. For example, you could get rid of plurals or you could correct spelling. There are three main methods. Correct spelling, which only very few people use in practice. Stemming reduces words to their word stem. The idea is to get rid of plural ‘s’ and ‘ing’ at the end of the verbs and so on so that you don't have the conjugations. Stemming removes some letters at the end of the word depending on a fixed rule set. Lemmatization also uses reduces the word stem but in a smart way. It tries to parse the sentence and then uses a dictionary of all English words with all its verb forms and conjugations and tries to map it to a standardized form. The reason why I picked this example is ‘meeting’ appears twice in here, once as a noun and once as a verb. Using stemming: ‘was’ becomes ‘wa’ ‘worse’ becomes ‘wors’ and so on ‘I’m’ is just split up in two without the apostrophe. ‘scared’ becomes ‘scare’ ‘meeting’ becomes ‘meet’ Lemmatization parsed the sentence and figured out that the first ‘meeting’ is a noun and keeps it as it is but the second ‘meeting’ which is a verb is simplified to ‘meet’. You can also see that the ‘was’ was normalized to ‘be’ and the ‘worse’ was normalized to ‘bad’ and the ’m was normalized to ‘be’ as well. The question with any kind of normalization you do is will this be helpful for your application. Q: is stemming only applicable to English literature? These are both language dependent. This model exists for many languages. Today, I’m only talking about languages that work similar to English. Scikit-learn only lowercases it. You can plug in other normalization from NLTK or spacy into the count vectorizer if you want, but the default is it lower cases the tokens. --- class: center, middle # Restricting the Vocabulary ??? The other main thing that you can play with is restricting the vocabulary. So far, I said, we're just going to use all the tokens that we see. --- # Stop Words ```python vect = CountVectorizer(stop_words='english') vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'want'] ``` ```python from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS print(list(ENGLISH_STOP_WORDS)) ``` .tiny-code[ ``` ['former', 'above', 'inc', 'off', 'on', 'those', 'not', 'fifteen', 'sometimes', 'too', 'is', 'move', 'much', 'own', 'until', 'wherein', 'which', 'over', 'thru', 'whoever', 'this', 'indeed', 'same', 'three', 'whatever', 'us', 'somewhere', 'after', 'eleven', 'most', 'de', 'full', 'into', 'being', 'yourselves', 'neither', 'he', 'onto', 'seems', 'who', 'between', 'few', 'couldnt', 'i', 'found', 'nobody', 'hereafter', 'therein', 'together', 'con', 'ours', 'an', 'anyone', 'became', 'mine', 'myself', 'before', 'call', 'already', 'nothing', 'top', 'further', 'thereby', 'why', 'here', 'next', 'these', 'ever', 'whereby', 'cannot', 'anyhow', 'thereupon', 'somehow', 'all', 'out', 'ltd', 'latterly', 'although', 'beforehand', 'hundred', 'else', 'per', 'if', 'afterwards', 'any', 'since', 'nor', 'thereafter', 'it', 'around', 'them', 'alone', 'up', 'sometime', 'very', 'give', 'elsewhere', 'always', 'cant', 'due', 'forty', 'still', 'either', 'was', 'beyond', 'fill', 'hereupon', 'no', 'might', 'by', 'everyone', 'five', 'often', 'several', 'and', 'something', 'formerly', 'she', 'him', 'become', 'get', 'could', 'ten', 'below', 'had', 'how', 'back', 'nevertheless', 'namely', 'herself', 'none', 'be', 'himself', 'becomes', 'hereby', 'never', 'along', 'while', 'side', 'amoungst', 'toward', 'made', 'their', 'part', 'everything', 'his', 'becoming', 'a', 'now', 'am', 'perhaps', 'moreover', 'seeming', 'themselves', 'name', 'etc', 'more', 'another', 'whither', 'see', 'herein', 'whom', 'among', 'un', 'via', 'every', 'cry', 'me', 'should', 'its', 'again', 'co', 'itself', 'two', 'yourself', 'seemed', 'under', 'then', 'meanwhile', 'anywhere', 'beside', 'seem', 'please', 'behind', 'sixty', 'were', 'in', 'upon', 'than', 'twelve', 'when', 'third', 'to', 'though', 'hence', 'done', 'other', 'where', 'someone', 'of', 'whose', 'during', 'many', 'as', 'except', 'besides', 'for', 'within', 'mostly', 'but', 'nowhere', 'we', 'our', 'through', 'both', 'bill', 'yours', 'less', 'well', 'have', 'therefore', 'one', 'last', 'throughout', 'can', 'mill', 'against', 'anyway', 'at', 'system', 'noone', 'that', 'would', 'only', 'rather', 'wherever', 'least', 'are', 'empty', 'almost', 'latter', 'front', 'my', 'amount', 'put', 'what', 'whereas', 'across', 'whereupon', 'otherwise', 'thin', 'others', 'go', 'thus', 'enough', 'her', 'fire', 'may', 'once', 'show', 'because', 'ourselves', 'some', 'such', 'yet', 'eight', 'sincere', 'from', 'been', 'twenty', 'whether', 'without', 'you', 'do', 'everywhere', 'six', 'however', 'first', 'find', 'hers', 'towards', 'will', 'also', 'even', 'or', 're', 'describe', 'serious', 'so', 'anything', 'must', 'ie', 'the', 'whenever', 'thick', 'bottom', 'they', 'keep', 'your', 'has', 'about', 'each', 'four', 'eg', 'interest', 'hasnt', 'detail', 'amongst', 'take', 'thence', 'down', 'fifty', 'whence', 'whereafter', 'nine', 'with', 'whole', 'there'] ``` ] ??? One way to restrict the vocabulary often is to remove stop words. Stop words, are words that are not so important to the content and so you discard them. If I do this for my example, the only thing that remains are ‘ants’ and ‘want’ because ‘this’, ‘I’ and ‘how’ are all stop words. Scikit-learn has some stop words built in. So if you do stop word equal to English, it will use the built-in the stop word list. It's a little bit strange and we're working on improving it. So for example, there’s ‘system’ in the stop word list and ‘bill’ is in the stop word list. For unsupervised problems, using stop words might be helpful. For supervised problems, I rarely found it helpful with the tools that we talked about. So problem is, these are nearly 200 words but we have 66,000 features, removing 200 features doesn't really make a difference. And because these words appear very often, the model won’t be able to figure out are they important or not. If you use an unsupervised model, in the unsupervised model since they appear so often, these might just dominate whatever clustering you do and so removing them might be a good idea. - not a very good stop-word list? Why is system in it? bill? - For supervised learning often little effect on large corpuses (on small corpuses and for unsupervised learning it can help) --- # Infrequent Words - Remove words that appear in less than 2 documents: .smaller[ ```python vect = CountVectorizer(min_df=2) vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'you'] ``` ] - Restrict vocabulary size to max_features most frequent words: .smaller[ ```python vect = CountVectorizer(max_features=4) vect.fit(malory) print(vect.get_feature_names()) ``` ``` ['ants', 'because', 'do', 'you'] ``` ] ??? Another thing that is often helpful is removing infrequent words. There's a parameter called min_df that basically removes words that appear less than twice. It'll only keep words that appeared twice or more, and so only ‘ants’ and ‘you’ remain in the toy dataset example. This is often useful because it can remove a lot of features. And if a word appears only once in your data, it's unlikely that your algorithm will be able to learn. If you have hundreds of thousands of thousands of features, and one feature appears once in your dataset, then it’s probably not going to be helpful. You can also go the other way around and set max_features. Max_features sets the number of features that you want to keep, in this case, four most common features. It's kind of interesting, this goes a little bit in the opposite direction of what the stop words do. The stop words remove the most common ones that are meaningless and here, we removed the most infrequent ones because they're probably not helpful. These are often misspellings, or names or something like that. --- .smaller[ ```python vect = CountVectorizer(min_df=2) X_train_df2 = vect.fit_transform(text_train) vect = CountVectorizer(min_df=4) X_train_df4 = vect.fit_transform(text_train) X_val_df4 = vect.transform(text_val) print(X_train.shape) print(X_train_df2.shape) print(X_train_df4.shape) ``` ``` (18750, 66651) (18750, 39825) (18750, 26928) ``` ```python lr = LogisticRegressionCV().fit(X_train_df4, y_train) lr.C_ ``` ``` array([ 0.046]) ``` ```python lr.score(X_val_df4, y_val) ``` ``` 0.881 ``` ] ??? Here I used the count vectorizer with min_df=2 and min_df=4. That means either use tokens that appear in 2 documents in a training set, or at least 4 documents in a training set. Min_df=2 gives me 39825 while min_df=4 gives me 26928. I’ve drastically reduced the feature space when using min_df=2 and if I set min_df=4, I cut down feature space even more. Min_df means minimum document frequency, meaning the number of documents the words appear in. It's not the number of times the word appears. So if a word appears in 100 times in a single document, it's still going to be thrown out by this. Here, I used X_train_df=4, which is much, much smaller, it’s less than half the size. And the result I get is identical, basically, in terms of accuracy. These are the main configuration you just used a single bag of words. That said, they're throwing away a lot of the context and information. One way to improve upon this is what's called N-grams. - Removed nearly 1/3 of features! - As good as before --- # Tf-idf rescaling $$ \text{tf-idf}(t,d) = \text{tf}(t,d)\cdot \text{idf}(t)$$ $$ \text{idf}(t) = \log\frac{1+n_d}{1+\text{df}(d,t)} + 1$$ $n_d$ = total number of documents
$df(d,t)$ = number of documents containing term $t$ -- * In sklearn: by default also L2 normalisation! ??? Next thing we can do is actually try to rescale our words with something called Tf-idf which stands for Term Frequency Inverse Document Frequency. This is sort of an alternative to using soft stop words in a sentence. The idea here is to down-weight things that are very common. It’s the logarithm of 1 over the number of documents containing the term. So if something happens to appear in most of the documents, the inverse document frequency will give it a very low weight. This is something that's very commonly used in information retrieval. Information retrieval basically means you're trying to find relevant documents. If you can think of how a search engine works, if you want to match a bag of words representation of two different query strings, you might not care so much about matching the ‘a’ and ‘the’, but you care a lot about matching the parts that are rare and that are very specific to a particular search term. That is the motivation for this. But people find it also sometimes helps in a machine learning context. You can’t really know in advance whether this helps for the particular dataset or not, but basically you should keep in mind, this emphasizes rare words and does a soft removal of very common words. This also means the sort of stop words here are corpus specific so here it will learn to down-weight ‘movie’ and ‘film’ because they're very common. The implementation of this in scikit-learn is a little bit non-standard but I don't think it makes a big difference. You should also keep in mind is that if you use the implementation of TF-IDF in scikit-learn by default, it will also do L2 normalization. L2 normalization means you divide each row by its length. That means you normalize the weight the length of the document. You only want to count how often the word appears relative to how long the document is. Basically, if you make a document that basically repeats each word twice, it should still have the same representation. By default, that's turned on for TF-IDF. But it's not turned on for the bag of words. * Emphasizes "rare" words - "soft stop word removal" * Slightly non-standard smoothing (many +1s) --- # TfidfVectorizer, TfidfTransformer .smaller[ ```python from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer ``` ```python malory_tfidf = TfidfVectorizer().fit_transform(malory) malory_tfidf.toarray() ``` ``` array([[ 0.41 , 0. , 0.576, 0. , 0. , 0. , 0.576, 0.41 ], [ 0.318, 0.447, 0. , 0.447, 0.447, 0.447, 0. , 0.318]]) ``` ```python malory_tfidf = make_pipeline(CountVectorizer(), TfidfTransformer()).fit_transform(malory) malory_tfidf.toarray() ``` ``` array([[ 0.41 , 0. , 0.576, 0. , 0. , 0. , 0.576, 0.41 ], [ 0.318, 0.447, 0. , 0.447, 0.447, 0.447, 0. , 0.318]]) ``` ] ??? In scikit-learn, there are two ways to do this. You can either use the TF-IDF vectorizer which you can use directly on the text and it will build the vocabulary and then do the rescaling. Or you can use to count vectorizer, TFIDFTransformer, which will work on the sparse matrix created by the count vectorizer and then just transform it using rescaling. This is an alternative to using stop words or putting more emphasis on rare words. Q: Explain the part about the L2 norm. For the L1 norm, it might be easier to visualize. Dividing by L1 norm would mean just dividing by word count and so that would mean that basically, if you have a very long review, and it says bad 3 times, but it says good 20 times, then you only care about the relative frequency. And if it's a very short review that says good, 0 times, and bad also 3 times, then it's probably a bad review. So you want to really scale these so that the relative frequency of the word only matters and not the total length document. That's sort of what the normalization does. And so the common thing to do is L2 normalization, which divides by Euclidean norm. --- # N-grams: Beyond single words - Bag of words completely removes word order. - "didn't love" and "love" are very different! -- .center[ ![:scale 80%](images/single_words.png) ] ??? Because there's a big difference between ‘didn't love’ and ‘love’ and the default settings in the count vectorizer cannot distinguish between the two because somewhere along the document, it could appear ‘don't’ and you don't know if it's in front of ‘love’, or if it’s ‘love, don't hate’ or ‘don't love hate’, since they basically have the same representation. The idea behind N-grams is you look at pairs of words that appear next to each other. Unigrams looks at single words, bigrams look at pairs of two words next to each other, and trigrams looks at three words next to each other, and so on. Here instead of splitting it up into single tokens, we can split it up into pairs of tokens allowing me to have a little bit of context around each of the words. - N-grams: tuples of consecutive words --- # Bigrams toy example .tiny-code[ ```python cv = CountVectorizer(ngram_range=(1, 1)).fit(malory) print("Vocabulary size: ", len(cv.vocabulary_)) print("Vocabulary:\n", cv.get_feature_names()) ``` ``` Vocabulary size: 8 Vocabulary: ['ants', 'because', 'do', 'get', 'how', 'that', 'want', 'you'] ``` ```python cv = CountVectorizer(ngram_range=(2, 2)).fit(malory) print("Vocabulary size: ", len(cv.vocabulary_)) print("Vocabulary:\n", cv.get_feature_names()) ``` ``` Vocabulary size: 8 Vocabulary: ['because that', 'do you', 'get ants', 'how you', 'that how', 'want ants', 'you get', 'you want'] ``` ```python cv = CountVectorizer(ngram_range=(1, 2)).fit(malory) print("Vocabulary size: ", len(cv.vocabulary_)) print("Vocabulary:\n", cv.get_feature_names()) ``` ``` Vocabulary size: 16 Vocabulary: ['ants', 'because', 'because that', 'do', 'do you', 'get', 'get ants', 'how', 'how you', 'that', 'that how', 'want', 'want ants', 'you', 'you get', 'you want'] ``` ] ??? When implementing in scikit-learn, you have to specify the N-gram range, which has the minimum number of tokens and the maximum number of tokens you want to look at. Usually, you want to look at Unigram + Bigrams + Trigrams, don't look at only Bigrams. If I only look at Unigrams, I get a vocabulary of size 8. If I look only at Bigrams, I get a vocabulary of size 8 as well, that's because this was a toy example and in the real world there will be much more Bigrams than Unigrams. If you look at the result from combining Unigrams and Bigrams together, we get 16, which is a combination of both the result. Now I have a feature representation of size 16, and so I extended my feature space so that I can take the context into account. Unigrams and Bigrams together are the most common combinations. Using higher order N-grams is rare. - Typically: higher n-grams lead to blow up of feature space! --- # N-grams on IMDB data ``` Vocabulary Sizes 1-gram (min_df=4): 26928 2-gram (min_df=4): 128426 1-gram & 2-gram (min_df=4): 155354 1-3gram (min_df=4): 254274 1-4gram (min_df=4): 289443 ``` ```python cv = CountVectorizer(ngram_range=(1, 4)).fit(text_train) print("Vocabulary size 1-4gram: ", len(cv.vocabulary_)) ``` ``` Vocabulary size 1-4gram (min_df=1): 7815528 ``` - More than 20x more 4-grams! ??? This is using no stop words. Now because many bigrams and many trigrams are very infrequent. So if I use the example of min_df=4 in Unigrams I get 26928 features while in Bigrams I get 128426 features. So it grew by nearly the order of magnitude. So I have a much, much bigger feature space now. If I put it together, it's even bigger. The higher I go, the bigger it gets. Though, you can see that it actually grows relatively slowly here. So if I look into 1-3grams programs, and 1-4grams, I only get 300,000 features. But this is mostly because I did the min_df=4. There are very few sequences of 4 words that appear often together. If I don't use the min_df=4 and I look at all things I get nearly 8 million features. This is like 20 times more four grams than what we had before. So particularly if you use higher N-grams using this pruning of document frequency might be quite helpful. You can see that probably they have about the same amount of information, but you’ll have 20 times fewer features, which will definitely impact runtime and possibly impact generalization. --- #Stop-word impact on bi-grams ```python cv = CountVectorizer(ngram_range=(1, 2), min_df=4) cv.fit(text_train) print("(1, 2), min_df=4: ", len(cv.vocabulary_)) cv = CountVectorizer(ngram_range=(1, 2), min_df=4, stop_words="english") cv.fit(text_train) print("(1, 2), stopwords, min_df=4: ", len(cv.vocabulary_)) ``` ``` (1, 2), min_df=4: 155354 (1, 2), stopwords, min_df=4: 81085 ``` ??? This is with stop words. What we're doing is we do stop words first and then N-grams. What I find interesting is that here using unigrams and bigrams we got 155354, but if we remove the stop words it gets halved to 81085. This is pretty extreme. The stop word list is only 200 words. So if we remove these 200 words and if you look at the bigrams we half the number of bigrams. This is because the most common combinations are actually we with stop words. So ‘the movie’ and ‘the film’ are probably the most common Bigrams in this. So if we remove this then whatever word that came before ‘the’ is now makes after bigram. And so now, the combinations appear much less frequently. --- # Stop-word impact on 4-grams ```python cv4 = CountVectorizer(ngram_range=(4, 4), min_df=4) cv4.fit(text_train) cv4sw = CountVectorizer(ngram_range=(4, 4), min_df=4, stop_words="english") cv4sw.fit(text_train) print(len(cv4.get_feature_names())) print(len(cv4sw.get_feature_names()))``` ``` 31585 369 ``` ??? 4grams are even rarer and so there are 31585 4grams that appear at least 4 times but if I remove the stop words I get only 369 4grams that appeared at least 4 times. --- ``` ['worst movie ve seen' '40 year old virgin' 've seen long time' 'worst movies ve seen' 'don waste time money' 'mystery science theater 3000' 'worst film ve seen' 'lose friends alienate people' 'best movies ve seen' 'don waste time watching' 'jean claude van damme' 'really wanted like movie' 'best movie ve seen' 'rock roll high school' 'don think ve seen' 'let face music dance' 'don say didn warn' 'worst films ve seen' 'fred astaire ginger rogers' 'ha ha ha ha' 'la maman et la' 'maman et la putain' 'left cutting room floor' 've seen ve seen' 'just doesn make sense' 'robert blake scott wilson' 'late 70 early 80' 'crouching tiger hidden dragon' 'low budget sci fi' 'movie ve seen long' 'toronto international film festival' 'night evelyn came grave' 'good guys bad guys' 'low budget horror movies' 'waste time watching movie' 'vote seven title brazil' 'bad bad bad bad' 'morning sunday night monday' '14 year old girl' 'film based true story' 'don make em like' 'silent night deadly night' 'rating saturday night friday' 'right place right time' 'friday night friday morning' 'night friday night friday' 'friday morning sunday night' 'don waste time movie' 'saturday night friday night' 'really wanted like film'] ``` ??? This is the most common 4grams after removing stop words. Whenever you have the ability to look into something interesting in our data, do it. --- .center[ ![:scale 75%](images/stopwords_1.png) ] .center[ ![:scale 75%](images/stopwords_2.png) ] ??? Here are some models using Unigrams, Bigrams, and Trigrams, with and without stop words. You can see that most of the most common features are actually Unigrams. It's interesting because previously ‘worth’ didn't show up but since we have bigrams, we can distinguish between ‘not worth’ and ‘well worth’. You can also see that most of these actually were Bigrams that included stop word. It gets slightly worse because all of these Bigrams actually have stop words in them. Stopwords removed fares slightly worse --- .tiny-code[ ```python my_stopwords = set(ENGLISH_STOP_WORDS) my_stopwords.remove("well") my_stopwords.remove("not") my_stopwords.add("ve") ``` ```python vect3msw = CountVectorizer(ngram_range=(1, 3), min_df=4, stop_words=my_stopwords) X_train3msw = vect3msw.fit_transform(text_train) lr3msw = LogisticRegressionCV().fit(X_train3msw, y_train) X_val3msw = vect3msw.transform(text_val) lr3msw.score(X_val3msw, y_val) ``` ``` 0.883 ``` ] .center[ ![:scale 90%](images/stopwords_3.png) ] ??? Basically, removing stop words is fine but you should keep in the ‘not’ and the ‘well’ so that the important things can still be expressed. --- class: middle # Character n-grams ??? The next thing I want to talk about is character n-grams. Character n-grams are a different way to extract features from text data. It's not that useful for general text classification but it's useful for some more specific applications. The idea is that, instead of looking at tokens, which are words, we look at windows of characters. --- #Principle
.center[ ![:scale 100%](images/char_ngram_1.png) ] --- #Principle
.center[ ![:scale 100%](images/char_ngram_2.png) ] --- #Principle
.center[ ![:scale 100%](images/char_ngram_3.png) ] --- #Principle
.center[ ![:scale 100%](images/char_ngram_4.png) ] --- #Principle
.center[ ![:scale 100%](images/char_ngram_5.png) ] ??? Here, I want to look at character trigrams, look at windows of length 3. My first one would be “Do_” and then “o_y” and then “_you” and so… Then you build the vocabulary over these trigrams. --- class: spacious #Applications - Be robust to misspelling / obfuscation - Language detection - Learn from Names / made-up words ??? This can be helpful to be more robust towards misspelling or obfuscation. So people might replace a single particular letter in a word with like internet speak. And so if you use character n-grams, you can still match if someone used the same characters for the rest of the word. You can also use it for language detection. Language detection, I think by now relatively soft task. But basically, it's very easy to detect the language, if you have a text, you can use character n-grams. You can't really use a bag of words because the words in different languages are sort of distinct, and you don't actually need to build vocabularies for all the languages. If you just look at the n-grams, that's enough to know whether something is English or French or something. Also, it helps to learn from names or any made-up words. One thing that is interesting in a social context is to look at ethnicity from names. And you can actually get someone's ethnicity from their last name, and in some cases, also from their first name. --- # Toy example - "Naive" .tiny-code[ ```python cv = CountVectorizer(ngram_range=(2, 3), analyzer="char").fit(malory) print("Vocabulary size: ", len(cv.vocabulary_)) print("Vocabulary:\n", cv.get_feature_names()) ``` ``` Vocabulary size: 73 Vocabulary: [' a', ' an', ' g', ' ge', ' h', ' ho', ' t', ' th', ' w', ' wa', ' y', ' yo', 'an', 'ant', 'at', 'at’', 'au', 'aus', 'be', 'bec', 'ca', 'cau', 'do', 'do ', 'e ', 'e t', 'ec', 'eca', 'et', 'et ', 'ge', 'get', 'ha', 'hat', 'ho', 'how', 'nt', 'nt ', 'nts', 'o ', 'o y', 'ou', 'ou ', 'ow', 'ow ', 's ', 's h', 's.', 's?', 'se', 'se ', 't ', 't a', 'th', 'tha', 'ts', 'ts.', 'ts?', 't’', 't’s', 'u ', 'u g', 'u w', 'us', 'use', 'w ', 'w y', 'wa', 'wan', 'yo', 'you', '’s', '’s '] ``` ] - Respect word boundaries .tiny-code[ ```python cv = CountVectorizer(ngram_range=(2, 3), analyzer="char_wb").fit(malory) print("Vocabulary size:", len(cv.vocabulary_)) print("Vocabulary:\n", cv.get_feature_names()) ``` ``` Vocabulary size: 74 Vocabulary: [' a', ' an', ' b', ' be', ' d', ' do', ' g', ' ge', ' h', ' ho', ' t', ' th', ' w', ' wa', ' y', ' yo', '. ', '? ', 'an', 'ant', 'at', ' at’', 'au', 'aus', 'be', 'bec', 'ca', 'cau', 'do', 'do ', 'e ', 'ec', 'eca', 'et', 'et ', 'ge', 'get', 'ha', 'hat', 'ho', 'how', 'nt', 'nt ', 'nts', 'o ', 'ou', 'ou ', 'ow', 'ow ', 's ', 's.', 's. ', 's?', 's? ', 'se', 'se ', 't ', 'th', 'tha', 'ts', 'ts.', 'ts?', 't’', 't’s', 'u ', 'us', 'use', 'w ', 'wa', 'wan', 'yo', 'you', '’s', '’s '] ``` ] ??? To do this with scikit-learn, you can use the count vectorizer with the analyzer equal char. So instead of word tokenization, it does character tokenization. Usually, we look at characters that are longer, so single characters don’t really tell us much. Here, I look at n-grams of size 2 and 3. You can also go up to, like 5 or 7, obviously, the feature space tends to explode if you go higher. After applying, I get a vocabulary of size 73, which is pretty big for such a short text. And there's also a featurette that can respect word boundaries called char_wb. So here, you get the end of one word and the beginning of the next word, you might want to exclude that if you're only interested on the actual words. So the char_wb makes sure that there's no white space inside the character n-gram. --- # IMDB Data .smaller[ ```python char_vect = CountVectorizer(ngram_range=(2, 5), min_df=4, analyzer="char_wb") X_train_char = char_vect.fit_transform(text_train) ``` ```python len(char_vect.vocabulary_) ``` ``` 164632 ``` ```python lr_char = LogisticRegressionCV().fit(X_train_char, y_train) X_val_char = char_vect.transform(text_val) lr_char.score(X_val_char, y_val) ``` ``` 0.881 ``` ] ??? If you want, you can use that for classification. Here, I use it again on the IMDb data set with 2-5grams. The vocabulary is quite a bit bigger than if I looked at single words. But actually, I get a result that is about as good as the bag of words. --- class: middle .center[ ![:scale 100%](images/imdb_char_ngrams.png) ] ??? I can also look at the features that are important. The way this looks like is that it seems pretty redundant. So it picked up different subparts of the words. So this might not be ideal, but the accuracy is still good. Another thing that I found quite interesting when I looked at this is that some people gave their star rating in the comment. This is actually leaking the label of the star rating. This is not something that we saw before but here looking at his character n-grams, we can see that that's something that's in the dataset. --- class: middle # Predicting Nationality from Name ??? A more useful application is going back to the European Parliament. What I'm going to do now is predict nationality from the name. --- .center[ ![:scale 70%](images/nationality_name_1.png) ] .center[ ![:scale 70%](images/nationality_name_2.png) ] ??? Here's the distribution of the country. This is a very imbalanced classification. --- # Comparing words vs chars .tiny-code[ ```python bow_pipe = make_pipeline(CountVectorizer(), LogisticRegressionCV()) cross_val_score(bow_pipe, text_mem_train, y_mem_train, scoring='f1_macro') ``` ``` array([ 0.231, 0.241, 0.236, 0.28 , 0.254]) ``` ```python char_pipe = make_pipeline(CountVectorizer(analyzer="char_wb"), LogisticRegressionCV()) cross_val_score(char_pipe, text_mem_train, y_mem_train, scoring='f1_macro') ``` ``` array([ 0.452, 0.459, 0.341, 0.469, 0.418]) ``` ] ??? I can try to do either with word or character n-gram. Using an f1 macro, I don’t get a good f1 score. If I use character n-grams, it's not super great but it's still quite a lot better than if we look at the words because I don't need to learn new digital names. --- # Grid-search parameters .tiny-code[ ```python from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import Normalizer param_grid = {"logisticregression__C": [100, 10, 1, 0.1, 0.001], "countvectorizer__ngram_range": [(1, 1), (1, 2), (1, 5), (1, 7), (2, 3), (2, 5), (3, 8), (5, 5)], "countvectorizer__min_df": [1, 2, 3], "normalizer": [None, Normalizer()] } grid = GridSearchCV(make_pipeline(CountVectorizer(analyzer="char"), Normalizer(), LogisticRegression(), memory="cache_folder"), param_grid=param_grid, cv=10, scoring="f1_macro" ) ``` ```python grid.fit(text_mem_train, y_mem_train) ``` ```python grid.best_score_ ``` ``` 0.58255198397046815 ``` ```python grid.best_params_ ``` ``` {'countvectorizer__min_df': 2, 'countvectorizer__ngram_range': (1, 5), 'logisticregression__C': 10} ``` ] ??? We can do a grid search, though that might take a while on the bigger dataset. Things that you want to tune are a regularization of your models, I always use logistic regression CV to tune it internally. If I want to tune multiple things at once, I could use as a pipeline grid search. Here, this is for character n-grams, but still, I might want to tune the n-gram range. If I was using bag of words, I might not look at sequences of 8 words because it will take very long to compute this and it's probably not going to be useful. Here, I'm using count vectorizer, but I could still normalize using the L2 normalizer. Another thing, if you do something like this, you don't want to fit each step of the pipeline every time. So some of these steps for the count vectorizer, if you have a big dataset, it might be pretty expensive and so you don't want to refit this again, just because you're plugged in the normalizer or just because you changed the C parameter in the logistic regression. Make pipeline has a parameter called memory that allows you to cash the results. So you just have to give it a non-empty string and it'll use that as the full name to catch the results. If there's no catch eviction, at some point you need to delete it, or your hard drive will run full. Here, you can see that after tuning everything, I actually get a result that's much better than what I had before. So before I had an f1 macro of 0.45 and now I have 0.58. So min_df=2 and ngram_range: (1,5) and c=10 gives the best results. - Small dataset, makes grid-search faster! (less reliable) --- .center[ ![:scale 100%](images/grid_search_table.png) ] ??? Here, I used pivot tables and Pandas to look into the effect of the different parameters. --- class: spacious # Other features - Length of text - Number of out-of-vocabularly words - Presence / frequency of ALL CAPS - Punctuation...!? (somewhat captured by char ngrams) - Sentiment words (good vs bad) - Whatever makes sense for the task! ??? There are other kinds of features that you might want to look into. There are actual collections of positive-negative words that you can download and you can check how many positive words are in this, how many negative words are in this. So basically, the model can share information about how important these words are because a particularly positive word might appear only very infrequently in your dataset but you can still learn that positive word means good. Bag of words is a pretty good baseline. It’s a very simple approach and often gives reasonable results. --- class: middle # Large Scale Text Vectorization ??? The last thing I want to mention is how you can scale this up for really large scale stuff. Basically just a very slight modification. So for example, if you want to use training a model on the Twitter stream, which has tweets coming in all the time, this has a couple of problems. - You don't want to store all the data - Your vocabulary might shift over time. You might not be able to store the whole vocabulary, because the whole vocabulary of what everybody says on Twitter is very large. There's a trick which allows you to get away without building a vocabulary. --- .center[ ![:scale 90%](images/bag_of_words.png) ] ??? This is what we did before. We tokenize the string, split onto words, build a vocabulary and then built a sparse matrix representation. Now we can replace this vocabulary with just a hash function. --- .center[ ![:scale 80%](images/large_scale_text_vec_2.png) ] ??? And we can use any string hash function, that computes half of the token, then we represent this token just by its hash number. Here, I used a hash method. This just hashes to 832,412 and so this is the number of the feature. Then I can use these as indices into my sparse matrix encoding. So now I need a hash function with some upper limits, let's say 1 billion or something, and as I get a feature vector of length 1 billion. And then I can use these hashes as indices into this feature vector. So I don't need to build a vocabulary which is pretty nice. --- # Trade-offs .left-column[ Pro: - Fast - Works for streaming data - Low memory footprint ] .right-column[ Con: - Can't interpret results - Hard to debug - (collisions are not a problem for model accuracy) ] ??? Here are pros and cons listed. It's hard to interpret the results because now you have like a feature vector of size million, but you don't know what the features correspond to. Previously we had a vocabulary and we could say feature 3 corresponds to X and now we don't know what it is. You could try to store the correspondences but that would sort of remove all the benefits. So that makes it kind of hard to debug. There can be hash collisions, which make it harder to interpret. 2 tokens can hash to the same index. In practice, that's not really a problem for accuracy. But it's a problem for interpretability because you can't undo the hash function so you don't know why your model learned the thing it learned. But if you have really big text data or streaming setting, this is a quite simple and effective method. --- # Near drop-in replacement .smallest[ - Careful: Uses l2 normalization by default! ] .tiny-code[ ```python from sklearn.feature_extraction.text import HashingVectorizer hv = HashingVectorizer() X_train = hv.transform(text_train) X_val = hv.transform(text_val) ``` ```python lr.score(X_val, y_val) ``` ```python from sklearn.feature_extraction.text import HashingVectorizer hv = HashingVectorizer() X_train = hv.transform(text_train) X_val = hv.transform(text_val) ``` ```python X_train.shape ``` ``` (18750, 1048576) ``` ```python lr = LogisticRegressionCV().fit(X_train, y_train) lr.score(X_val, y_val) ``` ] ??? This is a drop-in replacement for count vectorizer. It's the hashing vectorizer. Here, you can see by default, it actually uses a hash base of about a million. The result was about the same. It can be faster, and you get away without storing the vocabulary --- # Other libraries ## [nltk](https://www.nltk.org/) - Classic, comprehensiv, slightly outdated ## [spaCy](https://spacy.io/) - recommended; modern, fast; API still changing in parts. ## [gensim](https://radimrehurek.com/gensim/) - focus on topic modeling --- class: middle # Questions ?