<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anna Norberg</title>
	<atom:link href="http://annanorberg.net/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://annanorberg.net/blog</link>
	<description>Min anteckningsbok i cyberrymden.</description>
	<lastBuildDate>Thu, 12 Apr 2012 09:33:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Notes from NLP course: Sentiment analysis</title>
		<link>http://annanorberg.net/blog/2012/04/12/notes-from-nlp-sentiment-analysis/</link>
		<comments>http://annanorberg.net/blog/2012/04/12/notes-from-nlp-sentiment-analysis/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 09:33:54 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=899</guid>
		<description><![CDATA[What is sentiment analysis? A kind of text classification. Is a movie review good or bad? Product search Twitter sentiment vs Gallup poll of consumer confidence Twitter mood predicts the stock market Target sentiment on Twitter &#8211; Twitter sentiment app Sentiment analysis has many other names -opinion extraction -opinion mining -sentiment mining -subjectivity analysis Why [...]]]></description>
			<content:encoded><![CDATA[<p>What is sentiment analysis?</p>
<p>A kind of text classification.</p>
<p>Is a movie review good or bad?<br />
Product search<br />
Twitter sentiment vs Gallup poll of consumer confidence<br />
Twitter mood predicts the stock market<br />
Target sentiment on Twitter &#8211; Twitter sentiment app</p>
<p>Sentiment analysis has many other names<br />
-opinion extraction<br />
-opinion mining<br />
-sentiment mining<br />
-subjectivity analysis</p>
<p>Why sentiment analysis?<br />
-movie: is this review positive or negative?<br />
-products: what de people think about the new iPhone?<br />
-public sentiment: how is consumer confidence? is despair increasing?<br />
-politics: what do people think about this candidate or issue?<br />
-prediction: predict election outcomes or market trends from sentiment</p>
<p>Scherer typology of affective states</p>
<p>Emotion: brief organically synchronized &#8230; evaluation of major event<br />
-angry, sad, joyful, fearful</p>
<p>Mood: diffuse non-caused low-intensity long-duration change in subjective feeling<br />
-cheerful, gloomy, irritable</p>
<p>Interpersonal stances: affective stance toward another person in a specific interaction<br />
-friendly, distant, supportive, cold</p>
<p>Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons<br />
-liking, loving, hating, valuing</p>
<p>Personality traits: stable personality dispositions and typical behaviour tendencies<br />
-nervous, anxious, reckless, morose</p>
<p>Sentiment analysis is the detection of attitudes<br />
1. Holder (source) of attitude<br />
2. Target (aspect) of attitude<br />
3. Type of attitude<br />
-from a set of types, like, love, hate, value, desire, etc.<br />
-or (more commonly) simple weighted polarity, positive, negative, neutral together with strength<br />
4. Text containing the attitude<br />
-sentence or entire document</p>
<p>Simplest task: Is the attitude of this text positive or negative?<br />
More complex: Rank the attitude of this text from 1 to 5<br />
Advanced: Detect the target, source or complex attitude types</p>
<p>A baseline algorithm</p>
<p>Sentiment classification in movie reviews<br />
Polarity detection: Is and IMDB movie review positive or negative?<br />
Data: Polarity Data 2.0 (http://www.cs.cornell.edu/people/pabo/movie-review-data)</p>
<p>Baseline algorithm (adapted from Pang and Lee)</p>
<p>Tokenization<br />
Feature extraction<br />
Classification using different classifiers (Naive Bayes, MaxEnt, SVM)</p>
<p>Sentiment tokenization issues</p>
<p>Deal with HTML and XML markup<br />
Twitter mark-ups (names, hash tags)<br />
Capitalization (preserve for words in all caps)<br />
Phone numbers, dates<br />
Emoticons</p>
<p>Usefule code:<br />
Christopher Potts sentiment tokenizer<br />
Brendan O&#8217;Connor twitter tokenizer</p>
<p>Extracting features for sentiment classification</p>
<p>How to handle negation<br />
Which words to use?</p>
<p>Negation</p>
<p>Add NOT_ to every word between negation and following punctuation:<br />
didn&#8217;t like this movie, but I<br />
didn&#8217;t NOT_like NOT_this NOT_movie, but I</p>
<p>Binarized (Boolean feature) multinomial Naive Bayes</p>
<p>Intuition:</p>
<p>For sentiment (and probably for other text classification domains) word occurrence may matter more than word frequency.<br />
-the occurrence of the word fantastic tells us a lot<br />
-the fact that it occurs 5 times may not tell us much more<br />
Boolean multinomial Naive Bayes clips all the word counts in each document at 1.</p>
<p>Binary seems to work better than full word counts<br />
Other possibility: log(freq(w))</p>
<p>Cross-Validation</p>
<p>Break up data into 10 folds (equal positive and negative inside each fold?)<br />
For each fold, choose the fold as a temporary test set. Train on 9 folds, compute performance on the test fold.<br />
Report average performance of the 10 runs.</p>
<p>Other issues in classification</p>
<p>MaxEnt and SVM then to do better than Naive Bayes</p>
<p>Problems: What makes reviews hard to classify?</p>
<p>Subtlety</p>
<p>Review in Perfumes: the Guide: &#8220;If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.&#8221;</p>
<p>Dorothy Parker on Katherine Hepburn: &#8220;She runs the gamut of emotions from A to B.&#8221;</p>
<p>Thwarted expectations and ordering effects</p>
<p>&#8220;This film should be brilliant. It sounds like a great plot, the actors are first grade &#8230; However, it can&#8217;t hold up.&#8221;</p>
<p>&#8220;Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.&#8221;</p>
<p>Sentiment lexicons</p>
<p>The General Inquirer (free for research use)<br />
Home page: http://www.wjh.harvard.edu/~inquirer<br />
List of categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm<br />
Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls</p>
<p>LIWC (Lingustic Inquiry and Word Count)</p>
<p>http://www.liwc.net</p>
<p>MPQA Subjectivity Cues Lexicon<br />
Home page: http://www.cs.pitt.edy/mpqa/subj_lexicon.html</p>
<p>Bing Liu Opinion Lexicon</p>
<p>http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar</p>
<p>SentiWordNet<br />
Home page: http://sentiwordnet.isti.cnr.it</p>
<p>Analyzing the polarity of each word in IMDB (Christopher Potts)</p>
<p>How likely is each word to appear in each sentiment class?<br />
Count (&#8220;bad&#8221;) in 1-star, 2-star, 3-star, etc.<br />
But can&#8217;t use raw counts, instead likelihood.<br />
Make them comparable between words, scaled likelihood.</p>
<p>Other sentiment feature: Logical negation</p>
<p>Is logical negation (no, not, never) associated with negative sentiment?</p>
<p>Potts&#8217; experiment: Count negation in online reviews, regress against the review rating.</p>
<p>Potts 2011 results: More negation in negative sentiment.</p>
<p>Learning sentiment lexicons</p>
<p>Semi-supervised learning of lexicons</p>
<p>Use a small amount of information<br />
-a few labeled examples<br />
-a few hand-built patterns<br />
To bootstrap a lexicon</p>
<p>Hatzivassiloglou and McKeown intuition for identifying word polarity</p>
<p>Adjectives conjoined by &#8220;and&#8221; have same polarity<br />
-fair and legitimate, corrupt and brutal</p>
<p>Adjectives conjoined by &#8220;but&#8221; do not<br />
-fair but brutal</p>
<p>Hatzivassiloglou and McKeown 1997</p>
<p>Step 1<br />
Label seed set of 1336 adjectives<br />
657 positive, 679 negative</p>
<p>Step 2<br />
Expand seed set to conjoined adjectives<br />
Google &#8220;was nice and&#8221;<br />
what do we see:<br />
nice, helpful<br />
nice, classy</p>
<p>Step 3<br />
Supervised classifier assigns &#8220;polarity similarity&#8221; to each word pair, resulting in a graph</p>
<p>Step 4<br />
Clustering for partitioning the graph into two</p>
<p>Output polarity lexicon<br />
positive words/negative words</p>
<p>Turney algorithm<br />
1. Extract a phrasal lexicon from reviews.<br />
2. Learn polarity of each phrase.<br />
2. Rate a review by the average polarity of its phrases.</p>
<p>Extract two-word phrases with adjectives</p>
<p>How to measure polarity of a phrase?<br />
Positive phrases co-occur more with &#8220;excellent&#8221;.<br />
Negative phrases co-occur more with &#8220;poor&#8221;.</p>
<p>But how to measure co-occurrence?</p>
<p>Pointwoise mutual information</p>
<p>Mutual information between 2 random variables X and Y</p>
<p>Pointwise mutual information: How much more do events x and y co-occur than if they were independent?<br />
PMI between two words: How much more do two words co-occur than if they were independent?</p>
<p>How to estimate pointwise mutual information</p>
<p>Query search engine (Altavista)</p>
<p>Does phrase appear more with &#8220;poor&#8221; or &#8220;excellent&#8221;?</p>
<p>Polarity(phrase)=PMI(phrase, &#8220;excellent&#8221;)-PMI(phrase, &#8220;poor&#8221;)</p>
<p>Results of Turney algorithm<br />
-majority class baseline: 59%, Turney algorithm: 74%<br />
-phrases rather than words<br />
-learns domain-specific information</p>
<p>Using WordNet to learn polarity<br />
-create positive and negative seed-words<br />
-find synonyms and antonyms<br />
-repeat, following chains of synonyms<br />
-filter</p>
<p>Summary</p>
<p>Advantages<br />
-can be domain-specific<br />
-can be more robust (more words)</p>
<p>Intuition<br />
-start with a seed set of words<br />
-find other words that have similar polarity</p>
<p>Other sentiment tasks</p>
<p>Finding sentiment of a sentence</p>
<p>Important for finding aspects or attributes<br />
-target of sentiment</p>
<p>Ex. &#8220;The food was great but the service was awful.&#8221;<br />
positive sentiment about food, negative about service</p>
<p>Finding aspect/attribute/target of sentiment</p>
<p>Frequent phrases + rules<br />
-find all highly frequent phrases across reviews (&#8220;fish tacos&#8221;)<br />
-filter by rules like &#8220;occurs right after sentiment word&#8221;<br />
-&#8221;&#8230;great fish tacos&#8221; means fish tacos a likely aspect</p>
<p>casino: casino, buffet, pool, resort, beds<br />
children&#8217;s barber: haircut, job, experience, kids<br />
department store: selection, department, sales, shop, clothing</p>
<p>The aspect name may not be in the sentence. For restaurants/hotels, aspects are well-understood.</p>
<p>Supervised classification</p>
<p>Hand-label a small corpus of restaurant review sentences with aspect<br />
-food, decor, service, value, NONE</p>
<p>Train a classifier to assign an aspect to a sentence<br />
-&#8221;Given this sentence, is the aspect food, decor, service, value or NONE.&#8221;</p>
<p>Putting it all together: Finding sentiment for aspects</p>
<p>Reviews &gt; Text Extractor &gt; Sentences &amp; Phrases &gt; Sentiment Classifier &gt; Sentences &amp; Phrases &gt; Aspect Extractor &gt; Sentences &amp; Phrases &gt; Aggregator &gt; Final Summary</p>
<p>Baseline methods assume classes have equal frequencies</p>
<p>If not balanced (common in the real world) we can&#8217;t use accuracies as an evaluation, need to use F-scores.</p>
<p>Severe imbalancing also can degrade classifier performance.</p>
<p>Two common solutions:<br />
1. Resampling in training, random undersampling.<br />
2. Cost-sensitive learning, penalize SVM more for misclassification of the rare thing.</p>
<p>How to deal with 7 stars?<br />
1. Map to binary.<br />
2. Use linear or ordinal regression or specialized models like metric labeling.</p>
<p>Summary on sentiment</p>
<p>Generally modeled as classification or regression task<br />
-predict a binary or ordinal label</p>
<p>Features:<br />
-negation is important<br />
-using all words (in Naive Bayes) works well for some tasks<br />
-finding subset of words may help in other tasks (hand-built polarity lexicons, use seeds and semi-supervised learning to induce lexicons)</p>
<p>Computational work on other affective states</p>
<p>Emotion:<br />
Detecting annoyed callers to dialogue system<br />
Detecting confused/frustrated versus confident students</p>
<p>Mood:<br />
Finding tramatized or depressed writers</p>
<p>Interpersonal stances:<br />
Detection of flirtation or friendliness in conversations</p>
<p>Personality traits:<br />
Detection of extroverts</p>
<p>Detection of friendliness</p>
<p>Friendly speakers use collaborative conversational style<br />
-laughter<br />
-less use of negative emotional words<br />
-more sympathy<br />
-more agreement<br />
-less hedges</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/04/12/notes-from-nlp-sentiment-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Text classification, part two</title>
		<link>http://annanorberg.net/blog/2012/04/11/notes-from-nlp-text-classification-2/</link>
		<comments>http://annanorberg.net/blog/2012/04/11/notes-from-nlp-text-classification-2/#comments</comments>
		<pubDate>Wed, 11 Apr 2012 09:18:01 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=896</guid>
		<description><![CDATA[Evaluation of text classification Precision, Recall and the F measure The 2-by-2 contingency table tp, true positive fp, false positive fn, false negative tn, true negative accuracy = (tp+tn)/(tp+fp+fn+tn) How often does web pages mention shoe brands? out of a sample of 100,000 web pages it is mentioned 10 times, in 99,990 times it is [...]]]></description>
			<content:encoded><![CDATA[<p>Evaluation of text classification</p>
<p>Precision, Recall and the F measure</p>
<p>The 2-by-2 contingency table</p>
<p>tp, true positive<br />
fp, false positive<br />
fn, false negative<br />
tn, true negative</p>
<p>accuracy = (tp+tn)/(tp+fp+fn+tn)</p>
<p>How often does web pages mention shoe brands?</p>
<p>out of a sample of 100,000 web pages it is mentioned 10 times, in 99,990 times it is not mentioned</p>
<p>Precision and recall</p>
<p>Precision: % of selected items that are correct<br />
Recall: % of correct items that are selected</p>
<p>Prec=tp/(tp+fp)<br />
Rec=tp/(tp+fn)=0/10=0</p>
<p>A system with zero recalls is not interesting.</p>
<p>If you increase recall, precision drops.</p>
<p>A combined measure: F</p>
<p>A combined measure that assessees the P/R tradeoff is F measure (weighted harmonic mean).</p>
<p>The harmonic mean is a very conservative average.</p>
<p>People usually use balanced F1 measure: F=2PR/(P+R)</p>
<p>Evaluation: Classic Reuters-21578 Data Set<br />
-most used data set, 21,578 docs (each 90 types, 200 tokens)<br />
-9603 training, 3299 test articles<br />
-118 categories (an article can be in more than one category, learn 118 binary category distinctions)<br />
-average document (with at least one category) has 1.24 classes<br />
-only about 10 out of 118 categories are large</p>
<p>Confusion matrix c</p>
<p>For each pair of classes &lt;c1,c2&gt; how many documents from c1 were incorrectly assigned to c2?</p>
<p>Per class evaluation measures</p>
<p>Recall: fraction of docs in class i classified correctly</p>
<p>Precision: fraction of docs assigned class i that are actually about class i</p>
<p>Accuracy: (1 &#8211; error rate) fraction of docs classified correctly</p>
<p>Micro- vs. Macro-Averaging</p>
<p>If we have more than one class, how de we combine multiple performance measures into one quantity?</p>
<p>Macro-averaging: Compute performance for each class, then average.</p>
<p>Micro-averaging: Collect decisions for all classes, compute contingency table, evaluate.</p>
<p>Micro-averaged score is dominated by score on common classes.</p>
<p>Development Test Sets and Cross-Validation</p>
<p>Training set, development test set, unseen test set</p>
<p>Unseen test set<br />
-avoid overfitting<br />
-more conservative estimate of performance</p>
<p>Cross-validation over multiple splits<br />
-handle sampling errors from different datasets</p>
<p>Practical Issues</p>
<p>The real world</p>
<p>No training data?<br />
Manually written rules<br />
-need careful crafting, time-consuming</p>
<p>Very little data?<br />
Use Naive Bayes<br />
-high-bias algorithm<br />
Get more labeled data<br />
Try semi-supervised training methods</p>
<p>A reasonable amount of data?<br />
Perfect for all the clever classifiers<br />
You can even use user-interpretable decision trees</p>
<p>A huge amount of data?<br />
Can achieve high accuracy<br />
At a cost, can be too slow<br />
So Naive Bayes can come back again</p>
<p>Real world systems generally combine:<br />
Automatic classification<br />
Manual review of uncertain/difficult/&#8221;new&#8221; cases</p>
<p>Underflow prevention: log space</p>
<p>Multiplying lots of probabilities can result in floating-point underflow.<br />
Class with highest unnormalized log probability score is still most probable.<br />
Model is now just max of sum of weights.</p>
<p>How to tweak performance</p>
<p>Domain-specific features and weights: very important in real performance.<br />
Sometimes need to collapse terms.<br />
Upweighting: counting a word as if it occurred twice.</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/04/11/notes-from-nlp-text-classification-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Text classification, part one</title>
		<link>http://annanorberg.net/blog/2012/04/10/notes-from-nlp-text-classification-1/</link>
		<comments>http://annanorberg.net/blog/2012/04/10/notes-from-nlp-text-classification-1/#comments</comments>
		<pubDate>Tue, 10 Apr 2012 16:07:09 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=890</guid>
		<description><![CDATA[Text classification and Naive Bayes Recognising spam Authorship authorization Gender indentification Sentiment analysis &#8211; positive or negative movie review? Scientific articles &#8211; assigning subject categories, topics or genres Language identification Text classification: definition Input a document d a fixed set of classes C = {c1, c2, &#8230;, cj} Output a predicted class c from C [...]]]></description>
			<content:encoded><![CDATA[<p>Text classification and Naive Bayes</p>
<p>Recognising spam<br />
Authorship authorization<br />
Gender indentification<br />
Sentiment analysis &#8211; positive or negative movie review?<br />
Scientific articles &#8211; assigning subject categories, topics or genres<br />
Language identification</p>
<p>Text classification: definition</p>
<p>Input<br />
a document d<br />
a fixed set of classes C = {c1, c2, &#8230;, cj}</p>
<p>Output<br />
a predicted class c from C</p>
<p>Classification methods</p>
<p>Hand-coded rules</p>
<p>Rules based on combinations of words or other features<br />
-spam: black-list-address OR (&#8220;dollars&#8221; AND &#8220;have been selected&#8221;)</p>
<p>Accuracy an be high, if rules carefully refined by expert. But building and maintaining these rules is expensive.</p>
<p>Supervised machine learning</p>
<p>Input<br />
a document d<br />
a fixed set of classes C = {c1, c2, &#8230;, cj}<br />
a training set of m hand-labeled documents (d1,c1),&#8230;,(dm,cm)</p>
<p>Output<br />
a learned classifier y:d -&gt; c</p>
<p>Any kind of classifier:<br />
Naive Bayes<br />
Logistic regression<br />
Support-vector machines<br />
k-Nearest Neighbours<br />
&#8230;</p>
<p>No matter which classifier we use, the task of text classification is to take a document, its text and other kinds of features and extract features which represent the document and build a classifier that tells us which class the text document belongs to.</p>
<p>Naive Bayes<br />
-one of the most important text classification methods<br />
-based on Bayes rule<br />
-relies on very simple representation of document &#8211; bag of words</p>
<p>The bag of words representation: using a subset of words and their words</p>
<p>Bag of words for document classification</p>
<p>Bayes&#8217; Rule Applied to Documents and Classes</p>
<p>For a document d and a class c</p>
<p>P(c|d) = P(d|c)P(c) / P(d)</p>
<p>The probability of a class</p>
<p>How often does this class occur? We can just count the relative frequencies in a corpus. Coud only be estimated if a very, very large number of training examples was available.</p>
<p>Multinomial Naive Bayes independence assumptions</p>
<p>P(x1,x2,&#8230;,xn|c)</p>
<p>Bag of words assumption: Assume position doesn&#8217;t matter</p>
<p>Conditional independence: Assume the feature probabilities are independent given the class c.</p>
<p>Both assumptions are absolutely wrong, simplifying assumptions.</p>
<p>Learning the multinomial Naive Bayes model</p>
<p>First attempt: maximum likelihood estimates<br />
-simply use the frequencies in the data</p>
<p>Parameter estimation</p>
<p>fraction of times word wi appears among all words in documents of topic cj</p>
<p>create mega-document for topic j by concatenating all docs in this topic<br />
-use frequency f w in mega-document</p>
<p>Problem with maximum likelihood</p>
<p>What if we have seen no training documents with the word fantastic and classified in the topic positive?</p>
<p>Zero probabilities cannot be conditioned away, no matter the other evidence.</p>
<p>Solution: Laplace (add-1) smoothing for Naive Bayes</p>
<p>Steps:</p>
<p>From training corpus, extract Vocabulary</p>
<p>Calculate P(cj) terms<br />
for each cj in C do<br />
docsj &lt;&#8211; all docs with class = cj<br />
P(cj) &lt;&#8211; |docsj| / |total bradgard docuemnts|</p>
<p>Calculate P(wk|cj) terms<br />
textj &lt;&#8211; single doc containing all docsj<br />
for each word wk in Vocabulary<br />
nk &lt;&#8211; bradgard of occurences of wk in Textj<br />
P(wk|cj) &lt;&#8211; nk+alfa / n+alfa|Vocabulary|</p>
<p>Laplace smoothing: unknown words</p>
<p>Add one extra word to the vocabuary, the &#8220;unknown word&#8221; wu</p>
<p>Naive Bayes and language modeling</p>
<p>Naive Bayes classifiers can use any sort of feature<br />
-URL, email address, dictionaries, network features</p>
<p>But if<br />
-we use only word features<br />
-we use all of the words in the text (not a subset)</p>
<p>The Naive Bayes has an important similarity to language modeling.</p>
<p>Each class = a unigram language model</p>
<p>assigning each word P(word|c)<br />
assigning each sentence P(s|c)= product of P(word|c)</p>
<p>Which class assigns the higher probability to s?<br />
-is like running two language models<br />
-pick whatever model that has highest probability</p>
<p>priors<br />
P(c)=Nc/N</p>
<p>conditional probabilities<br />
P(w|c)= count(w,c)+1 / count(c)+|V|</p>
<p>Naive Bayes in spam filtering</p>
<p>SpamAssassin features:<br />
mentions generic viagra<br />
online pharmacy<br />
subject is all capitals<br />
&#8230;</p>
<p>Summary: Naive Bayes is not so naive</p>
<p>Very fast, low storage requirements</p>
<p>Robust to irrelevant features<br />
-irrelevant features cancel each other without affecting results</p>
<p>Very good in domains with many equally important features<br />
-decision trees suffer from fragmentation in such cases &#8211; especially if little data</p>
<p>Optimal if the independence assumptions hold<br />
-if assumed independence is correct, then it is the Bayes optimal classifier for problem</p>
<p>A good dependable baseline for text classification</p>
<p>But we will se other classifiers that give better accuracy</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/04/10/notes-from-nlp-text-classification-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Spelling correction</title>
		<link>http://annanorberg.net/blog/2012/03/28/notes-from-nlp-spelling-correction/</link>
		<comments>http://annanorberg.net/blog/2012/03/28/notes-from-nlp-spelling-correction/#comments</comments>
		<pubDate>Wed, 28 Mar 2012 09:19:29 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=850</guid>
		<description><![CDATA[Applications for spelling correction -word processing -web search -phones Spelling tasks -spelling error detection -spelling error correction: autocorrect, suggest a correction, suggest lists Types of spelling errors -non-word errors, graffe&#62;giraffe -real-word errors: typographical errors, three&#62;there, cognitive errors (homophones), piece&#62;peace, too&#62;two Rates of spelling errors 26% web queries 13% retyping, no backspace 7% words corrected retyping [...]]]></description>
			<content:encoded><![CDATA[<p>Applications for spelling correction<br />
-word processing<br />
-web search<br />
-phones</p>
<p>Spelling tasks<br />
-spelling error detection<br />
-spelling error correction: autocorrect, suggest a correction, suggest lists</p>
<p>Types of spelling errors<br />
-non-word errors, graffe&gt;giraffe<br />
-real-word errors: typographical errors, three&gt;there, cognitive errors (homophones), piece&gt;peace, too&gt;two</p>
<p>Rates of spelling errors<br />
26% web queries<br />
13% retyping, no backspace<br />
7% words corrected retyping on phone-sized organizer<br />
2% words uncorrected on organizer<br />
1-2% retyping</p>
<p>Non-word spelling errors</p>
<p>Non-word spelling error detection<br />
-any word in dictionary is an error<br />
-the larger the dictionary the better</p>
<p>Non-word spelling error correction<br />
-generate candidates, real words that are similar to error<br />
-choose the one which is best, shortest weighted edit distance, highest noisy channel probability</p>
<p>Real-word spelling errors</p>
<p>For each word w, generate candidate set<br />
-find candidate words with similar pronunciations<br />
-find candidate words with similar spelling<br />
-include w in candidate set</p>
<p>Choose best candidate<br />
-noisy channel<br />
-classifier</p>
<p>The Noisy Channel Model of Spelling</p>
<p>Noisy Channel Intuition</p>
<p>original word &gt; noisy channel &gt; noisy word &gt; decoder &gt; guessed word</p>
<p>Noisy Channel</p>
<p>We see an observation x of a misspelled word<br />
Find the correct word w</p>
<p>channel model/error model and language model</p>
<p>History: Noisy channel for spelling proposed around 1990, IBM and AT&amp;T Bell Labs.</p>
<p>Non-word spelling error example</p>
<p>acress</p>
<p>Candidate generation<br />
-words with similar spelling, small edit distance to error<br />
-words with similar pronunciation, small edit distance of pron. to error</p>
<p>Damerau-Levenshtein edit distance</p>
<p>Minimal edit distance between two string, where edits are:</p>
<p>-insertion<br />
-deletion<br />
-substitution<br />
-transposition of two adjacent letters</p>
<p>Words within 1 of acress<br />
acress, actress (deletion)<br />
acress, cress (insertion)<br />
acress, caress (transposition)<br />
acress, access (substitution)<br />
acress, across<br />
acress, acres<br />
acress, acres</p>
<p>80% of errors are within edit distance 1<br />
almost all errors within edit distance 2</p>
<p>Also allow insertion of space or hyphen<br />
thisidea &gt; this idea<br />
inlaw &gt; in-law</p>
<p>Language Model</p>
<p>Use any of the language modeling algorithms we&#8217;ve learned<br />
Unigram, bigram, trigram<br />
Web-scale spelling correction, stupid backoff</p>
<p>Unigram prior probability<br />
counts from 404,253,213 words in Corpus of Contemporary English (COCA)</p>
<p>Channel model probability<br />
also error model probability, edit probability</p>
<p>misspelled word x=x1,x2,x3&#8230;xm<br />
correct word w=w1,w2,w3&#8230;wn</p>
<p>P(x|w)=probability of the edit<br />
(del, sub, ins, trans)</p>
<p>Computing error probability: confusion matrix</p>
<p>del[x,y]: count(xy typed as x)<br />
ins[x,y]: count(x typed as xy)<br />
sub[x,y]: count(x typed as y)<br />
trans[x,y]: count(xy typed as yx)</p>
<p>Insertion and deletion conditioned on previous character.</p>
<p>Generating the confusion matrix<br />
-Peter Norvig&#8217;s list of errors (norvig.com/ngrams/spell-errors.txt)<br />
-Peter Norvig&#8217;s list of counts of single-edit errors</p>
<p>Channel model (fomulas)</p>
<p>Using a unigram language model, makes a mistake<br />
Using a bigram language model, picks correct word</p>
<p>Evaluation</p>
<p>Some spelling error test sets<br />
-Wikipedia&#8217;s list of common English misspelling<br />
-Aspell filtered version of that list<br />
-Birkbeck spelling error corpus<br />
-Peter Norvig&#8217;s list of errors (includes Wikipedia and Birkbeck, for training or testing)</p>
<p>Real-word spelling errors<br />
25-40% of spelling errors are real words</p>
<p>Solving real-word spelling errors</p>
<p>For each word in sentence, generate candidate set<br />
-the word itself<br />
-all single-letter edits that are English words<br />
-words that are homophones</p>
<p>Choose best candidates<br />
-noisy channel model<br />
-task-specific classifier</p>
<p>Noisy channel for real-word spell correction</p>
<p>Given a sentence w1,w2,w3&#8230;wn<br />
Generate a set of candidates for each word wi<br />
Choose the sequence W that maximizes P(W)</p>
<p>two of thew &#8230;</p>
<p>two: to, tao, too, two<br />
of: off, on, of<br />
thew: threw, thaw, the, thew</p>
<p>of all the possible sets of sentences produced by the graph, what&#8217;s the most probable one according to noisy channel model?</p>
<p>Simplification: One error per sentence<br />
-two off thew<br />
-two of the<br />
-too of thew<br />
&#8230;</p>
<p>Out of all possible sentences with one word replaced, choose the sequence W that maximizes P(W).</p>
<p>Where to get the probabilities?</p>
<p>Language model: unigram, bigram, etc<br />
Channel model: same as for non-word spelling correction plus need probability for no error P(w|w)</p>
<p>Probability of no error</p>
<p>What is the channel probability for a correctly typed word?<br />
Obviously this depends on the application.</p>
<p>Peter Norvig&#8217;s &#8220;thew&#8221; example</p>
<p>State of the art systems</p>
<p>HCI issues in spelling<br />
-if very confident in correction &#8211; autocorrect<br />
-less confident &#8211; give the best correction<br />
-even less confident &#8211; give a correction list<br />
-unconfident &#8211; just flag as an error</p>
<p>State of the art noisy channel<br />
-we never just multiply the prior and the error model<br />
-independence assumptions &gt; probabilities not commensurate<br />
-instead: weight them, learn lambda from a development test set</p>
<p>Phonetic error model</p>
<p>Metaphone, used in GNU aspell: convert misspelling to metaphone pronunciation, for example<br />
-drop duplicate adjacent letters, except for c<br />
-if the word begins with kn, gn, pn, ae, wr, drop first letter</p>
<p>Find words whose pronunciation is 1-2 edit distance from misspelling: score result list<br />
-weighted edit distance of candidate to misspelling<br />
-edit distance of candidate pronunciation to misspelling pronunciation</p>
<p>Improvements to channel model<br />
-allow richer edits<br />
-incorporate pronunciation into channel</p>
<p>Factors that could influence p(misspelling|word)<br />
-the source letter<br />
-the target letter<br />
-surrounding letters<br />
-the position in the word<br />
-nearby keys on the keyboard<br />
-homology on the keyboard<br />
-pronunciations<br />
-likely morpheme transformations</p>
<p>Classifier-based methods for real-word spelling correction</p>
<p>Instead of just channel model and language model, use many features in classifier.</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/28/notes-from-nlp-spelling-correction/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Language modeling, part two</title>
		<link>http://annanorberg.net/blog/2012/03/26/notes-from-nlp-language-modeling-p2/</link>
		<comments>http://annanorberg.net/blog/2012/03/26/notes-from-nlp-language-modeling-p2/#comments</comments>
		<pubDate>Mon, 26 Mar 2012 09:07:22 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=852</guid>
		<description><![CDATA[Smoothing: Add-one (Laplace) smoothing The intuition of smoothing When we have sparse statistics, steal probability mass to generalize better, so zeroes go away. Add-one estimation -also called Laplace smoothing -pretend we saw each word one more time than we did -just add one to all the counts MLE estimate Add-1 estimate Reminder: Maximum Likelihood Estimate [...]]]></description>
			<content:encoded><![CDATA[<p>Smoothing: Add-one (Laplace) smoothing</p>
<p>The intuition of smoothing</p>
<p>When we have sparse statistics, steal probability mass to generalize better, so zeroes go away.</p>
<p>Add-one estimation<br />
-also called Laplace smoothing<br />
-pretend we saw each word one more time than we did<br />
-just add one to all the counts</p>
<p>MLE estimate<br />
Add-1 estimate</p>
<p>Reminder: Maximum Likelihood Estimate<br />
-of some parameter of a model M from a training set T<br />
-maximizes the likelihood of the training set T given the model M</p>
<p>Suppose the word &#8220;bagel&#8221; occurs 400 times in a corpus of a million words. What is the probability that a random word from some other text will be &#8220;bagel&#8221;?</p>
<p>MLE estimate is 400/1,000,000=0.0004</p>
<p>This may be a bad estimate for some other corpus, but it is the estimate that makes it most likely that &#8220;bagel&#8221; will occur 400 times in a million word corpus.</p>
<p>Reconstituted counts. How much has the add-one smoothing changed the probabilities?</p>
<p>Add-1 estimation is a blunt instrument. It gives massive changes to probability counts.</p>
<p>So add-1 ins&#8217;t used for N-grams. We&#8217;ll se better methods. But add-1 is used to smooth other NLP models, like text classification or in domains where the number of zeros isn&#8217;t so huge.</p>
<p>Backoff and Interpolation</p>
<p>Sometimes it helps to use less context.</p>
<p>Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram</p>
<p>Interpolation: mix unigram, bigram, trigram</p>
<p>Interpolation works better.</p>
<p>Linear Interpolation</p>
<p>Two kinds: Simple interpolation and lambdas conditional on context</p>
<p>How to set the lambdas?</p>
<p>Use a held-out corpus: training data, held-out data, test data</p>
<p>Choose lambdas to maximize the probability of held-out data<br />
-fix the N-gram probabilities (on the training data), then search for lambdas that give largest probability to held out set</p>
<p>Unknown words: Open versus closed vocabulary tasks</p>
<p>If we know all the words in advance<br />
-vocabulary V is fixed<br />
-closed vocabulary task</p>
<p>Often we don&#8217;t know this<br />
-Out Of Vocabulary = OOV words<br />
-open vocabulary task</p>
<p>Instead, create an unknown word token &lt;UNK&gt;<br />
-training of &lt;UNK&gt; probabilities: create a fixed lexicon L of size V, at text normalization phase any training word not in L changed to &lt;UNK&gt;, now we train its probabilities like a normal word</p>
<p>At decoding time<br />
-if text input, use UNK probabilities for any word not in training</p>
<p>Huge web-scale N-grams</p>
<p>How to deal with Google N-gram corpus</p>
<p>Pruning<br />
-only store N-grams with count&gt;threshold, remove singletons of higher-order n-grams<br />
-entropy-based pruning</p>
<p>Efficiency<br />
-efficient data structures like tries<br />
-bloom filters: approximate language models<br />
-store words as indexes, not strings, use Huffman coding to fit large numbers of words into two bytes<br />
-quantize probabilities (4-8 bits instead of 8-byte float)</p>
<p>Smoothing for web-scale N-grams<br />
-&#8221;stupid backoff&#8221;<br />
-no discounting, just use relative frequencies</p>
<p>N-gram smoothing summary:</p>
<p>Add-1 smoothing<br />
-ok for text categorization, not for language modeling</p>
<p>The most commonly used method<br />
-extended interpolated Kneser-Ney</p>
<p>For very large N-grams like the web<br />
-stupid backoff</p>
<p>Advanced Language Modeling</p>
<p>Discriminative models<br />
-choose n-gram weights to improve a task, not to fit the training set</p>
<p>Parsing-based models</p>
<p>Caching models<br />
-recently used words are more likely to appear<br />
-these perform very poorly for speech recognition. Why?</p>
<p>Advanced: Good Turing Smoothing</p>
<p>Add-1 smoothing<br />
Add-k smoothing<br />
Unigram prior smoothing</p>
<p>Advanced smoothing algorithms</p>
<p>Intuition used by many smoothing algorithms, Good-Turing, Kneser-Ney, Witten-Bell.</p>
<p>Use the count of things we&#8217;ve seen once to help estimate the count of things we&#8217;ve never seen.</p>
<p>Notation:</p>
<p>Nc = Frequency of frequency c</p>
<p>Nc= the count of things we&#8217;ve seen c times</p>
<p>Sam I am I am Sam I do not eat</p>
<p>I &#8211; 3<br />
Sam &#8211; 2<br />
am &#8211; 2<br />
do &#8211; 1<br />
not &#8211; 1<br />
eat &#8211; 1</p>
<p>N1=3<br />
N2=2<br />
N3=1</p>
<p>Good-Turing smoothing intuition</p>
<p>Imagine that you are fishing and caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish</p>
<p>How likely is it that next species is trout?<br />
1/18</p>
<p>How likely is it that next species is new (catfish or bass)?<br />
let&#8217;s use our estimate of things-we-saw-once to estimate the new things<br />
3/18 (because N1=3)</p>
<p>Assuming so, how likely is it that next species is trout?<br />
must be less than 1/18, we&#8217;ve used some of our probability mass for new fish. how to estimate?</p>
<p>Good-Turing calculations</p>
<p>Unseen (bass or catfish)<br />
c=0<br />
MLE p=0/18=0<br />
P*(unseen)=N1/N=3/18</p>
<p>Seen once (trout)<br />
c=1<br />
MLE p=1/18<br />
c*(trout)=2*N2/N1=2*1/3=2/3<br />
P*(trout)=(2/3)/18=1/27</p>
<p>Ney et al. Good Turing Intuition</p>
<p>Held-out words<br />
pull out one word from a training set</p>
<p>Intuition from leave-one-out validation<br />
-take each of the c training words out in turn<br />
-c training sets of size c-1, held-out of size 1<br />
-what fraction of held-out words are unseen in training? N1/c<br />
-what fraction of held-out words are seen k times in training? (k+1)Nk+1/c<br />
-so in the future we expect (k+1)Nk+1/c of the words to be those with training count k<br />
-there are Nk words with training count k<br />
-each should occur with probability (k+1)Nk=1/c/Nk<br />
or expected count k*=(k+1)Nk+1/Nk</p>
<p>Good-Turing complications</p>
<p>Problem: what about &#8220;the&#8221;? (say c=4417)<br />
for small k, Nk&gt;Nk+1<br />
for large k, too jumpy, zeroes wreck estimates</p>
<p>Simple Good-Turing: replace empirical Nk with a best-fit power law once count counts get unreliable</p>
<p>Resulting Good-Turing numbers</p>
<p>Numbers from Church and Gale<br />
22 million words of AP Newswire<br />
each count has been discounted to leave room for zeroes<br />
what&#8217;s the relationship between the count and the Good-Turing count?<br />
fixed small discount</p>
<p>Kneser-Ney Smoothing</p>
<p>Absolute discounting interpolation<br />
save ourselves some time and just subtract 0.75 or some d</p>
<p>But should we really just use the regular unigram P(w)?</p>
<p>Kneser Ney Smoothing I</p>
<p>Better estimate for probabilities of lower-order unigrams.<br />
Shannon game: I can&#8217;t see without my reading ___<br />
&#8220;Francisco&#8221; is more common than &#8220;glasses&#8221;<br />
but &#8220;Franscisco&#8221; always follows &#8220;San&#8221;.</p>
<p>The unigram is useful exactly when we haven&#8217;t seen this bigram.</p>
<p>Instead of P(w): How likely is w?<br />
Pcontinuation(w): How likely is w to appear as a novel continuation?<br />
-for each word count the number of bigram types it completes<br />
-every bigram type was a novel continuaton the first time it was seen</p>
<p>Kneser-Ney Smoothing II</p>
<p>How many times does w appear as a novel continuation, normalized by the total number of word bigram types.</p>
<p>Kneser-Ney Smoothing III</p>
<p>Alternative metaphor: the number of # of word types seen to precede w, normalized by the # of words preceding all words.</p>
<p>A frequent word (Francisco) occurring in only one contect (San) will have a low continuation probability.</p>
<p>Kneser-Ney Smoothing IV</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/26/notes-from-nlp-language-modeling-p2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Language modeling, part one</title>
		<link>http://annanorberg.net/blog/2012/03/23/notes-from-nlp-language-modeling-1/</link>
		<comments>http://annanorberg.net/blog/2012/03/23/notes-from-nlp-language-modeling-1/#comments</comments>
		<pubDate>Fri, 23 Mar 2012 11:48:18 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=854</guid>
		<description><![CDATA[Language Modeling Goal: assign a probability to a sentence Why? -machine translation, P(high winds tonight)&#62;P(large winds tonight) -spell correction, &#8220;The office is about fifteen minuets from my house&#8221;, P(about fifteen minutes from)&#62;P(about fifteen minuets from) -speech recognition, P(I saw a van)&#62;P(eyes awe of an) -summarization, question-answering, etc. Probabilistic Langague Modeling Goal: compute the probability of [...]]]></description>
			<content:encoded><![CDATA[<p>Language Modeling</p>
<p>Goal: assign a probability to a sentence</p>
<p>Why?<br />
-machine translation, P(high winds tonight)&gt;P(large winds tonight)<br />
-spell correction, &#8220;The office is about fifteen <strong>minuets</strong> from my house&#8221;, P(about fifteen minutes from)&gt;P(about fifteen minuets from)<br />
-speech recognition, P(I saw a van)&gt;P(eyes awe of an)<br />
-summarization, question-answering, etc.</p>
<p>Probabilistic Langague Modeling</p>
<p>Goal: compute the probability of a sentence or sequence of words, P(W)=P(w1,w2,w3,w4,w5&#8230;wn)</p>
<p>Related task: probability of an upcoming word, P(w5|w1,w2,w3,w4)</p>
<p>A model that computes either of these is called a language model. &#8220;The grammar&#8221; would be better, but langague model or LM is standard.</p>
<p>How to compute P(W)</p>
<p>how to compute joint probability: P(its, water, is, so, transparent, that)</p>
<p>intuition &#8211; let&#8217;s rely on the Chain Rule of Probability</p>
<p>Reminder: The Chain Rule</p>
<p>recall the definition of conditional probabilities<br />
P(A|B)=P(A,B)/P(B)</p>
<p>rewriting<br />
P(A|B)P(B)=P(A,B)<br />
P(A,B)=P(A|B)P(B)</p>
<p>more variables<br />
P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)</p>
<p>the chain rule in general<br />
P(x1,x2,x3,&#8230;,xn)=P(x1)P(x2|x1)P(x3|x1,x2)&#8230;P(xn|x1,&#8230;,Xn-1)</p>
<p>The Chain Rule applied to compute joint probability of words in sentence:</p>
<p>P(&#8220;its water is so transparent&#8221;)=P(its)*P(water|its)*P(is|its water)*P(so|its water is)*P(transparent|its water is so)</p>
<p>How to estimate these probabilities</p>
<p>Could we just count and divide?</p>
<p>P(the|its water is so transparent that)=Count(its water is so transparent that the)/Count(its water is so transparent that)</p>
<p>No, too many possible sentences. We&#8217;ll never see enough data for estimating these.</p>
<p>Markov Assumption</p>
<p>simplifying assumption:<br />
P(the|its water is so transparent that) ~ P(the|that)</p>
<p>or maybe<br />
P(the|its water is so transparent that) ~ P(the|transparent that)</p>
<p>We approximate each component in the product.</p>
<p>Simplest case: Unigram model<br />
Slightly more sophisticated: Bigram model &#8211; condition on the previous word</p>
<p>N-gram models</p>
<p>We can extend to trigrams, 4-grams, 5-grams. In general this is an insufficient model of language, because language has long-distance dependencies.</p>
<p>&#8220;The computer which I had just put into the machine room on the fifth floor crashed.&#8221;</p>
<p>But in practice we can often get away with N-gram models.</p>
<p>Estimating bigram probabilities</p>
<p>The Maximum Likelihood Estimate (formula)</p>
<p>An example:</p>
<p>&lt;s&gt; I am Sam &lt;/s&gt;<br />
&lt;s&gt; Sam I am &lt;/s&gt;<br />
&lt;s&gt; I do not like green eggs and ham &lt;/s&gt;</p>
<p>P(I|&lt;s&gt;)=2/3=0.67<br />
P(&lt;s&gt;|Sam)=1/2-0.5<br />
P(Sam|&lt;s&gt;)=1/3=0.33<br />
P(Sam|am)=1/2=0.5<br />
P(am|I)=2/3=0.67<br />
P(do|I)=1/3=0.33</p>
<p>More examples: Berkeley Restaurant Project sentences<br />
-can you tell me about any good cantonese restaurants close by<br />
-mid priced thai food is what i&#8217;m looking for<br />
-tell me about chez panisse<br />
-can you give me a listing of the kinds of food that are available<br />
-i&#8217;m looking for a good place to eat breakfast<br />
-when is caffe venezia open during the day</p>
<p>Raw bigram counts<br />
Normalize by unigrams</p>
<p>Bigram estimates of sentence probabilities<br />
P(&lt;s&gt; I want english food &lt;/s&gt;)=<br />
P(I|&lt;s&gt;)*P(want|I)*P(english|want)*P(food|english)*P(&lt;/s&gt;|food)</p>
<p>What kinds of knowledge?</p>
<p>P(english|want)=0.0011<br />
P(chinese|want)=0.0065</p>
<p>fact about the world, chinese food more popular, more people want it</p>
<p>P(to|want)=0.66</p>
<p>want&gt;infinitive, grammatical fact</p>
<p>In practice we do everything in log space<br />
-avoid underflow<br />
-also adding is faster than multiplying</p>
<p>Language Modeling Toolkits</p>
<p>publicly available:<br />
SRILM<br />
Google N-Gram<br />
Google Book N-grams</p>
<p>Evaluation and Perplexity</p>
<p>Evaluation: How good is our model?</p>
<p>Does our language model prefer good sentences to bad ones?<br />
-assign higher probability to &#8220;real&#8221; or &#8220;frequently observed&#8221; sentences than &#8220;ungrammatical&#8221; or &#8220;rarely observed&#8221; sentences?<br />
We train parameters of our model on a training set.<br />
We test the model&#8217;s performance on data we haven&#8217;t seen<br />
-a test set is an unseen dataset that is different from our training set, totally unused<br />
-an evaluation metric tells us how well our model does on the test set</p>
<p>Extrinsic evaluation of N-gram models</p>
<p>Best evaluation for comparing models A and B<br />
-put each model in a task (spelling corrector, speech recognizer, MT system)<br />
Run the task, get an accuracy for A and B<br />
-how many misspelled words corrected properly<br />
-how many words translated correctly<br />
Compare accuracy for A and B</p>
<p>Difficulty of extrinsic (in-vivo) evaluation of N-gram models</p>
<p>-time-consuming, can take days or weeks<br />
-so instead we use intrinsic evaluation<br />
-the most common is called perplexity<br />
-bad approximation unless the test data looks just like the training data<br />
-generally only useful in pilot experiments, but is helpful to think about</p>
<p>Intuition of Perplexity</p>
<p>The Shannon Game: How well can we predict the next word?<br />
I always order pizza with cheese and ___<br />
The 33rd President of the US was ___<br />
I saw a ___</p>
<p>Unigrams are terrible at this game. Why?</p>
<p>A better model of a text is one which assigns a higher probability to the word that actually occurs.</p>
<p>The best language model is one that best predicts an unseen test set<br />
-gives the highest P(sentence)</p>
<p>Perplexity is the probability of the test set, normalized by the number of words.</p>
<p>Minimizing perplexity is the same as maximizing probability.</p>
<p>The Shannon Game intuition for perplexity<br />
-from Josh Goodman<br />
-average branching factor<br />
-how hard is the task of recognizing digits &#8217;0,1,2,3,4,5,6,7,8,9&#8242; &#8211; perplexity 10<br />
-how hard is recognizing (30,000) names at Microsoft &#8211; perplexity 30,000<br />
-if a system has to recognize Operator (1 in 4), Sales (1 in 4), Technical Support (1 in 4) or 30,000 names (1 in 120,000 each) &#8211; perplexity is weighted equivalent branching factor, perplexity is 53</p>
<p>Perplexity as branching factor</p>
<p>Let&#8217;s suppose a sentence consisting of random digits. What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?</p>
<p>Lower perplexity=better model</p>
<p>Generalization and zeros</p>
<p>The Shannon Visualization Method<br />
-choose a random bigram (&lt;s&gt;,w) according to its probability<br />
-now choose a random bigram (w,x) according to its probability<br />
-and so on until we choose &lt;/s&gt;<br />
-then string the words together</p>
<p>Approximating Shakespeare</p>
<p>Shakespeare as corpus<br />
N=884,647 tokens, V=29,066</p>
<p>Shakespeare produced 300,000 bigram types out of 844 million possible bigrams.</p>
<p>So 99.96% of the possible bigrams were never seen (have zero entries in the table).</p>
<p>Quadrigrams worse: What&#8217;s coming out looks like Shakespeare because it is Shakespeare.</p>
<p>The perils of overfitting</p>
<p>N-grams only work well for word prediction if the test corpus looks like the training corpus. In real life, it often doesn&#8217;t. We need to train robust models that generalize. One kind of generalization: Zeros (things that never occur in the training set, but do occur in the test set).</p>
<p>Zeros</p>
<p>Training set:<br />
&#8230; denied the allegations<br />
&#8230; denied the reports<br />
&#8230; denied the claims<br />
&#8230; deined the request</p>
<p>P(&#8220;offer&#8221;|denied the)=0</p>
<p>Test set:<br />
&#8230; denied the offer<br />
&#8230; denied the loan</p>
<p>Probability will be zero. Big problem!</p>
<p>Zero probability bigrams</p>
<p>Bigrams with zero probability mean that we will assign 0 probability to the test set and hence we cannot compute perplexity (can&#8217;t divide by 0).</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/23/notes-from-nlp-language-modeling-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Edit distance</title>
		<link>http://annanorberg.net/blog/2012/03/17/notes-from-nlp-edit-distance/</link>
		<comments>http://annanorberg.net/blog/2012/03/17/notes-from-nlp-edit-distance/#comments</comments>
		<pubDate>Sat, 17 Mar 2012 12:00:18 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=838</guid>
		<description><![CDATA[Defining minimum edit distance How similar are two strings? -spell correction -computational biology The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substition) needed to transform one into the other. two strings and their alignment INTENTION EXECUTION del, sub, sub, ins, sub if each operation has cost of [...]]]></description>
			<content:encoded><![CDATA[<p>Defining minimum edit distance</p>
<p>How similar are two strings?<br />
-spell correction<br />
-computational biology</p>
<p>The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substition) needed to transform one into the other.</p>
<p>two strings and their alignment</p>
<p>INTENTION<br />
EXECUTION</p>
<p>del, sub, sub, ins, sub</p>
<p>if each operation has cost of 1, distance in this case is 5<br />
if substituions cost 2 (Levenshtein), distance is 8</p>
<p>Alignment in computational biology<br />
-given two sequences, align each letter to a letter or gap</p>
<p>Other uses of edit distance in NLP<br />
-evaluating machine translation and speech recognition<br />
-named entity extraction and entity coreference</p>
<p>How to find the minimum edit distance?</p>
<p>Searching for a path (sequence of edits) from the start string to the final string<br />
-inital state: the word we&#8217;re transforming<br />
-operators: insert, delete, substitute<br />
-goal state: the word we&#8217;re trying to get to<br />
-path cost: what we want to minimize, the number of edits</p>
<p>Minimum edit as search</p>
<p>But the space of all edit sequences is huge<br />
-can&#8217;t afford to navigate naively<br />
-lots of distinct paths wind up at the same state, don&#8217;t have to keep track of all of them, just the shortest path to each of those revisited states</p>
<p>Definition of minimum edit distance</p>
<p>For two strings<br />
X of length n<br />
Y of length m<br />
D(i,j) is the edit distance between X[1..i] and Y[1..j]<br />
the edit distance between X and Y is thus D(n,m)</p>
<p>Computing minimum edit distance</p>
<p>Dynamic programming: A tabular computation of D(n.m)<br />
Solving problems by combining solutions to subproblems<br />
Bottom-up</p>
<p>Equation defining minimum edit distance (Levenshtein)</p>
<p>The edit distance table</p>
<p>Backtrace for computing alignments</p>
<p>Computing alignments<br />
-Edit distance isn&#8217;t sufficient<br />
-We keep a “backtrace”<br />
-Every time we enter a cell, remember where we came from. When we reach the end, trace back the path from the upper right corner to read off the alignment.</p>
<p>Adding backtrace to minimum edit distance</p>
<p>The distance matrix<br />
-Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences.<br />
-An optimal alignment is composed of optimal subalignments.</p>
<p>Result of backtrace<br />
two strings and their alignment</p>
<p>Performance<br />
Time: O(nm)<br />
Space: O(nm)<br />
Backtrace: O(n+m)</p>
<p>Weighted edit distance</p>
<p>Why add weight?<br />
-Spell correction<br />
-Biology</p>
<p>Minimum edit distance in computational biology</p>
<p>Why sequence alignment?</p>
<p>Comparing genes or regions from different species<br />
Assembling fragments to sequence DNA<br />
Compare individuals to looking for mutations</p>
<p>Alignments in two fields<br />
-In NLP we talk about distance (minimized) and weights<br />
-In Computational Biology we talk about similarity (maximized) and scores</p>
<p>The Needleman-Wunsch algorithm</p>
<p>The Needleman-Wunsch matrix</p>
<p>A variant of the basic algorithm<br />
-unlimited number of gaps in beginning and end, we don&#8217;t want to penalize this<br />
-different types of overlaps</p>
<p>The overlap detection variant</p>
<p>The local alignment problem<br />
Given two strings find substrings whose simmilarity is maximum</p>
<p>The Smith-Waterman algorithm<br />
Idea: Ignore badly aligning regions</p>
<p>Modifications to Needleman-Wunsch</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/17/notes-from-nlp-edit-distance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Basic text editing</title>
		<link>http://annanorberg.net/blog/2012/03/16/notes-from-nlp-basic-text-editing/</link>
		<comments>http://annanorberg.net/blog/2012/03/16/notes-from-nlp-basic-text-editing/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 15:07:18 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=830</guid>
		<description><![CDATA[Basic text editing Regular expressions formal language for specifying text strings Disjunctions any letters inside square brackets [ ] ranges [A-Z] negations [^Ss] caret ^ means &#8220;do not match&#8221; pipe &#124; search for either or [groundhog&#124;woodchuck] special characters: ? previous character optional * zero or more of previous characters + one or more of previous [...]]]></description>
			<content:encoded><![CDATA[<p>Basic text editing</p>
<p>Regular expressions<br />
formal language for specifying text strings</p>
<p>Disjunctions</p>
<p>any letters inside square brackets [ ]<br />
ranges [A-Z]<br />
negations [^Ss] caret ^ means &#8220;do not match&#8221;<br />
pipe | search for either or [groundhog|woodchuck]</p>
<p>special characters:<br />
? previous character optional<br />
* zero or more of previous characters<br />
+ one or more of previous characters<br />
. any character</p>
<p>anchors:<br />
^ matches beginning of the line<br />
$ matches end of the line<br />
\ escape, if you want to search for a period which is a special character you have to escape it [\.]</p>
<p>Errors</p>
<p>False positives (Type I) matching strings that we should not have matched<br />
False negatives (Type II) not matching things that we should have matched</p>
<p>constantly dealing with these errors</p>
<p>reducing error rate means:<br />
-increasing accuracy or precision<br />
-increasing coverage or recall</p>
<p>Word tokenization</p>
<p>Text normalization</p>
<p>Every NLP task needs to do text normalization:<br />
-segmenting/tokenizing words in running text<br />
-normalizing word formats<br />
-segmenting sentences in running text</p>
<p>How many words is in a sentence?<br />
-fragments, filled pauses<br />
-lemma: same stem, part of speech, rough word sense,<br />
wordform: the full inflected surface form</p>
<p>Type: an element of the vocabulary<br />
Token: an instance of that type in running text</p>
<p>N = number of tokens<br />
V = vocabulary = set of types<br />
|V| is the size of the vocabulary</p>
<p>Issues in tokenization</p>
<p>apostrophe<br />
abbreviated words<br />
double names<br />
hyphenated words<br />
lowercase<br />
New York &#8211; one or two tokens?</p>
<p>even more problematic in other languages<br />
in German noun compounds are not segmented, German information retrieval needs compound splitter<br />
Chinese/Japanese: no spaces between words<br />
further complicated in Japanese, mutiple alphabets intermingled</p>
<p>Word tokenization in Chinese, also called word segmentation</p>
<p>Chinese words are composed of characters<br />
characters are generally 1 syllable and 1 morpheme<br />
average word is 2.4 characters long</p>
<p>standard baseline segmentation algorithm: maximum matching (also called greedy) doesn&#8217;t generally work in English, but very well in Chinese</p>
<p>Word normalization and stemming</p>
<p>Normalization</p>
<p>need to “normalize” terms<br />
implicitly define equivalence classes of terms<br />
alternative: assymetric expansion<br />
potentially more powerful, but less efficient</p>
<p>Case folding</p>
<p>reduce all letters to lower case<br />
single users tend to use lower case<br />
possible exception: upper case in mid-sentence<br />
for sentiment analysis, MT, information extraction case is helpful</p>
<p>Lemmatization</p>
<p>reduce inflections or variant forms to base form<br />
finding the correct dictionary headword form<br />
machine translation</p>
<p>Morphology</p>
<p>morphemes: the small meaningful units that make up words<br />
stems: the core meaning-bearing units<br />
affixes: bits and pieces that adhere to stems, often grammatical functions</p>
<p>Stemming</p>
<p>reduce terms to their stems in information retrieval<br />
stemming is crude chopping of affixes<br />
language dependent<br />
e.g. automate(s), automatic, automation all reduced to automat</p>
<p>Porter&#8217;s algorithm &#8211; most common English stemmer</p>
<p>set of rules, for example:<br />
sses &gt; ss<br />
ies &gt; i<br />
ss &gt; ss<br />
s &gt; 0</p>
<p>Viewing morphology in a corpus<br />
Why only strip -ing if there is a vowel?<br />
rule: (*v*)ing &gt; 0</p>
<p>Dealing with complex morphology is sometimes necessary<br />
Some languages requires complex morpheme segmentation, Turkish</p>
<p>Sentence segmentation and decision trees</p>
<p>Sentence segmentation</p>
<p>!, ? are relatively unambiguous<br />
period “.” is quite ambiguous (sentence boundary, abbreviations, numbers)</p>
<p>to solve the period problem, build a binary classifier:<br />
looks at a “.”<br />
decides EndOfSentence/NotEndOfSentence<br />
to build them use hand-written rules, regular expressions or machine-learning</p>
<p>Determining if a word is end-of-sentence: Decision tree</p>
<p>more sophisticated decision tree features:<br />
-case of word with “.”<br />
-case of word after “.”<br />
-numeric features: length of word, probability</p>
<p>Implementing decision trees</p>
<p>a decision tree is just an if-then-else statement<br />
the interesting research is choosing the features<br />
too hard to build by hand</p>
<p>Decision trees and other classifiers</p>
<p>questions in decision tree could be used in any kind of classifier</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/16/notes-from-nlp-basic-text-editing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from NLP course: Introduction</title>
		<link>http://annanorberg.net/blog/2012/03/14/notes-from-nlp-introduction/</link>
		<comments>http://annanorberg.net/blog/2012/03/14/notes-from-nlp-introduction/#comments</comments>
		<pubDate>Wed, 14 Mar 2012 12:06:08 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[stanford]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=802</guid>
		<description><![CDATA[The course in Natural Language Processing from Stanford University has started. Here are notes from the first introduction video. Natural Language Processing Applications: Information Extraction, Sentiment Analysis, Machine Translation Language Technology mostly solved: Spam detection Part-of-speech tagging (POS) Named entity recognition (NER) making good progress: Sentiment analysis Coreference resolution Word sense disambiguation (WSD) Parsing Machine [...]]]></description>
			<content:encoded><![CDATA[<p>The course in <a href="https://www.coursera.org/nlp/auth/welcome">Natural Language Processing from Stanford University</a> has started. Here are notes from the first introduction video.</p>
<p>Natural Language Processing</p>
<p>Applications: Information Extraction, Sentiment Analysis, Machine Translation</p>
<p>Language Technology</p>
<p>mostly solved:<br />
Spam detection<br />
Part-of-speech tagging (POS)<br />
Named entity recognition (NER)</p>
<p>making good progress:<br />
Sentiment analysis<br />
Coreference resolution<br />
Word sense disambiguation (WSD)<br />
Parsing<br />
Machine translation (MT)<br />
Information extraction (IE)</p>
<p>still really hard:<br />
Question answering (QA)<br />
Paraphrase<br />
Summarization<br />
Dialog</p>
<p>Ambiguity makes NLP hard</p>
<p>headline: Violinist Linked to JAL Crash Blossoms<br />
might think that the violinist is linked to “JAL Crash Blossoms” whatever that means<br />
actual meaning was: Violinist (who was linked to JAL crash) blossoms<br />
another example: Red Tape Holds Up New Bridges (hold up can mean 1 delay 2 to support)</p>
<p>Ambiguity is pervasive</p>
<p>New York Times headline: Fed raises interest rates<br />
parser would also see verb &#8220;interest&#8221;, not only &#8220;raises&#8221;</p>
<p>even more difficult: Fed raises interest rates 0.5%<br />
&#8220;rates&#8221; could be interpreted as a verb</p>
<p>Why else is natural language understanding difficult?</p>
<p>non-standard English<br />
segmentation issues<br />
idioms<br />
neologisms<br />
world knowledge<br />
tricky entity names</p>
<p>Making progress on this problem is difficult</p>
<p>tools: knowledge about language, knowledge about the world, a way to combine knowledge sources<br />
how we do this: probabilistic models built from language data<br />
rough text features can do some of the job</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/03/14/notes-from-nlp-introduction/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Research med sociala nätverk</title>
		<link>http://annanorberg.net/blog/2012/02/14/research-sociala-natverk/</link>
		<comments>http://annanorberg.net/blog/2012/02/14/research-sociala-natverk/#comments</comments>
		<pubDate>Tue, 14 Feb 2012 10:30:20 +0000</pubDate>
		<dc:creator>Anna</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[linkedin]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://annanorberg.net/blog/?p=691</guid>
		<description><![CDATA[Jag har uppdaterat mig lite på hur man kan använda sociala nätverk i research och tänkte dela med mig av de bästa tipsen som jag hittade. Linkedin Några tips finns att hämta från Linkedins egen guide för journalister. Poynter har listat tio sätt för reportrar att använda Linkedin för att hitta källor och följa förändringar [...]]]></description>
			<content:encoded><![CDATA[<p>Jag har uppdaterat mig lite på hur man kan använda sociala nätverk i research och tänkte dela med mig av de bästa tipsen som jag hittade.</p>
<h2>Linkedin</h2>
<p>Några tips finns att hämta från <a href="http://press.linkedin.com/understanding-linkedin">Linkedins egen guide för journalister</a>. Poynter har listat <a href="http://www.poynter.org/how-tos/newsgathering-storytelling/137926/10-ways-reporters-can-use-linkedin-to-find-sources-track-changes-at-companies/">tio sätt för reportrar att använda Linkedin</a> för att hitta källor och följa förändringar i företag.</p>
<h2>Twitter</h2>
<p>Twitter har också en egen <a href="https://dev.twitter.com/media/newsrooms/report">guide för journalister</a> och Knights Digital Media Center har gjort en omfattande <a href="http://multimedia.journalism.berkeley.edu/tutorials/twitter/">guide för nybörjare</a>. Mer avancerade tips finns i guiderna från <a href="http://www.poynter.org/how-tos/digital-strategies/146345/10-ways-journalists-can-use-twitter-before-during-and-after-reporting-a-story/">Poynter</a>, <a href="http://www.journalism.co.uk/insite/tag/advanced-twitter-research/">Insite</a> och <a href="http://stevebuttry.wordpress.com/2011/10/06/advanced-twitter-techniques-for-journalists/">Steve Buttry</a>.</p>
<p>För att söka efter tweets kan man använda Twitters <a href="https://twitter.com/#!/search-home">vanliga sökfunktion</a> eller den <a href="https://twitter.com/#!/search-advanced">avancerade sökfunktionen</a>. Twitter har även <a href="https://dev.twitter.com/docs/using-search">sökoperatorer</a> som kan förbättra sökningen. För att hitta tweets som är äldre än en vecka kan man använda <a href="http://topsy.com/">Topsy</a>. När man är inloggad på Twitter kan man spara sökningarna som man gör.</p>
<h2>Google+</h2>
<p>Mashable skriver om <a href="http://mashable.com/2011/07/17/journalists-using-google-plus/">fem sätt journalister använder Google+</a> på. De har även en lång <a href="http://mashable.com/2011/07/16/google-plus-guide/">guide</a> till hur det fungerar.</p>
<h2>Facebook</h2>
<p>Knights Digital Media Center har en ordentlig <a href="http://multimedia.journalism.berkeley.edu/tutorials/facebook-journalists/">guide</a> även för Facebook och på den här <a href="http://www.facebook.com/journalists">Facebook-sidan </a>finns ännu fler tips för journalister. <a href="http://www.poynter.org/latest-news/media-lab/social-media/145991/5-things-journalists-need-to-know-about-new-facebook-subscription-feature/">Poynter skriver</a> om vad journalister måste veta om den nya Subscribe-funktionen.</p>
<p>Men det ni inte får missa är att läsa om hur Josephine Freje Simonsson och Linnea Hambe använder till exempel <a href="http://journalisten.se/nyheter/facebook-och-twitter-utmarkta-gravverktyg">Facebook och Twitter</a> för att göra <a href="http://www.second-opinion.se/so/view/1819">research &#8220;undercover&#8221;</a>. Se även deras <a href="http://bambuser.com/v/1528529">föreläsning</a> från Gräv 11.</p>
]]></content:encoded>
			<wfw:commentRss>http://annanorberg.net/blog/2012/02/14/research-sociala-natverk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

