What is sentiment analysis?

A kind of text classification.

Is a movie review good or bad?
Product search
Twitter sentiment vs Gallup poll of consumer confidence
Twitter mood predicts the stock market
Target sentiment on Twitter – Twitter sentiment app

Sentiment analysis has many other names
-opinion extraction
-opinion mining
-sentiment mining
-subjectivity analysis

Why sentiment analysis?
-movie: is this review positive or negative?
-products: what de people think about the new iPhone?
-public sentiment: how is consumer confidence? is despair increasing?
-politics: what do people think about this candidate or issue?
-prediction: predict election outcomes or market trends from sentiment

Scherer typology of affective states

Emotion: brief organically synchronized … evaluation of major event
-angry, sad, joyful, fearful

Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
-cheerful, gloomy, irritable

Interpersonal stances: affective stance toward another person in a specific interaction
-friendly, distant, supportive, cold

Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
-liking, loving, hating, valuing

Personality traits: stable personality dispositions and typical behaviour tendencies
-nervous, anxious, reckless, morose

Sentiment analysis is the detection of attitudes
1. Holder (source) of attitude
2. Target (aspect) of attitude
3. Type of attitude
-from a set of types, like, love, hate, value, desire, etc.
-or (more commonly) simple weighted polarity, positive, negative, neutral together with strength
4. Text containing the attitude
-sentence or entire document

Simplest task: Is the attitude of this text positive or negative?
More complex: Rank the attitude of this text from 1 to 5
Advanced: Detect the target, source or complex attitude types

A baseline algorithm

Sentiment classification in movie reviews
Polarity detection: Is and IMDB movie review positive or negative?
Data: Polarity Data 2.0 (http://www.cs.cornell.edu/people/pabo/movie-review-data)

Baseline algorithm (adapted from Pang and Lee)

Tokenization
Feature extraction
Classification using different classifiers (Naive Bayes, MaxEnt, SVM)

Sentiment tokenization issues

Deal with HTML and XML markup
Twitter mark-ups (names, hash tags)
Capitalization (preserve for words in all caps)
Phone numbers, dates
Emoticons

Usefule code:
Christopher Potts sentiment tokenizer
Brendan O’Connor twitter tokenizer

Extracting features for sentiment classification

How to handle negation
Which words to use?

Negation

Add NOT_ to every word between negation and following punctuation:
didn’t like this movie, but I
didn’t NOT_like NOT_this NOT_movie, but I

Binarized (Boolean feature) multinomial Naive Bayes

Intuition:

For sentiment (and probably for other text classification domains) word occurrence may matter more than word frequency.
-the occurrence of the word fantastic tells us a lot
-the fact that it occurs 5 times may not tell us much more
Boolean multinomial Naive Bayes clips all the word counts in each document at 1.

Binary seems to work better than full word counts
Other possibility: log(freq(w))

Cross-Validation

Break up data into 10 folds (equal positive and negative inside each fold?)
For each fold, choose the fold as a temporary test set. Train on 9 folds, compute performance on the test fold.
Report average performance of the 10 runs.

Other issues in classification

MaxEnt and SVM then to do better than Naive Bayes

Problems: What makes reviews hard to classify?

Subtlety

Review in Perfumes: the Guide: “If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”

Dorothy Parker on Katherine Hepburn: “She runs the gamut of emotions from A to B.”

Thwarted expectations and ordering effects

“This film should be brilliant. It sounds like a great plot, the actors are first grade … However, it can’t hold up.”

“Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.”

Sentiment lexicons

The General Inquirer (free for research use)
Home page: http://www.wjh.harvard.edu/~inquirer
List of categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls

LIWC (Lingustic Inquiry and Word Count)

http://www.liwc.net

MPQA Subjectivity Cues Lexicon
Home page: http://www.cs.pitt.edy/mpqa/subj_lexicon.html

Bing Liu Opinion Lexicon

http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

SentiWordNet
Home page: http://sentiwordnet.isti.cnr.it

Analyzing the polarity of each word in IMDB (Christopher Potts)

How likely is each word to appear in each sentiment class?
Count (“bad”) in 1-star, 2-star, 3-star, etc.
But can’t use raw counts, instead likelihood.
Make them comparable between words, scaled likelihood.

Other sentiment feature: Logical negation

Is logical negation (no, not, never) associated with negative sentiment?

Potts’ experiment: Count negation in online reviews, regress against the review rating.

Potts 2011 results: More negation in negative sentiment.

Learning sentiment lexicons

Semi-supervised learning of lexicons

Use a small amount of information
-a few labeled examples
-a few hand-built patterns
To bootstrap a lexicon

Hatzivassiloglou and McKeown intuition for identifying word polarity

Adjectives conjoined by “and” have same polarity
-fair and legitimate, corrupt and brutal

Adjectives conjoined by “but” do not
-fair but brutal

Hatzivassiloglou and McKeown 1997

Step 1
Label seed set of 1336 adjectives
657 positive, 679 negative

Step 2
Expand seed set to conjoined adjectives
Google “was nice and”
what do we see:
nice, helpful
nice, classy

Step 3
Supervised classifier assigns “polarity similarity” to each word pair, resulting in a graph

Step 4
Clustering for partitioning the graph into two

Output polarity lexicon
positive words/negative words

Turney algorithm
1. Extract a phrasal lexicon from reviews.
2. Learn polarity of each phrase.
2. Rate a review by the average polarity of its phrases.

Extract two-word phrases with adjectives

How to measure polarity of a phrase?
Positive phrases co-occur more with “excellent”.
Negative phrases co-occur more with “poor”.

But how to measure co-occurrence?

Pointwoise mutual information

Mutual information between 2 random variables X and Y

Pointwise mutual information: How much more do events x and y co-occur than if they were independent?
PMI between two words: How much more do two words co-occur than if they were independent?

How to estimate pointwise mutual information

Query search engine (Altavista)

Does phrase appear more with “poor” or “excellent”?

Polarity(phrase)=PMI(phrase, “excellent”)-PMI(phrase, “poor”)

Results of Turney algorithm
-majority class baseline: 59%, Turney algorithm: 74%
-phrases rather than words
-learns domain-specific information

Using WordNet to learn polarity
-create positive and negative seed-words
-find synonyms and antonyms
-repeat, following chains of synonyms
-filter

Summary

Advantages
-can be domain-specific
-can be more robust (more words)

Intuition
-start with a seed set of words
-find other words that have similar polarity

Other sentiment tasks

Finding sentiment of a sentence

Important for finding aspects or attributes
-target of sentiment

Ex. “The food was great but the service was awful.”
positive sentiment about food, negative about service

Finding aspect/attribute/target of sentiment

Frequent phrases + rules
-find all highly frequent phrases across reviews (“fish tacos”)
-filter by rules like “occurs right after sentiment word”
-”…great fish tacos” means fish tacos a likely aspect

casino: casino, buffet, pool, resort, beds
children’s barber: haircut, job, experience, kids
department store: selection, department, sales, shop, clothing

The aspect name may not be in the sentence. For restaurants/hotels, aspects are well-understood.

Supervised classification

Hand-label a small corpus of restaurant review sentences with aspect
-food, decor, service, value, NONE

Train a classifier to assign an aspect to a sentence
-”Given this sentence, is the aspect food, decor, service, value or NONE.”

Putting it all together: Finding sentiment for aspects

Reviews > Text Extractor > Sentences & Phrases > Sentiment Classifier > Sentences & Phrases > Aspect Extractor > Sentences & Phrases > Aggregator > Final Summary

Baseline methods assume classes have equal frequencies

If not balanced (common in the real world) we can’t use accuracies as an evaluation, need to use F-scores.

Severe imbalancing also can degrade classifier performance.

Two common solutions:
1. Resampling in training, random undersampling.
2. Cost-sensitive learning, penalize SVM more for misclassification of the rare thing.

How to deal with 7 stars?
1. Map to binary.
2. Use linear or ordinal regression or specialized models like metric labeling.

Summary on sentiment

Generally modeled as classification or regression task
-predict a binary or ordinal label

Features:
-negation is important
-using all words (in Naive Bayes) works well for some tasks
-finding subset of words may help in other tasks (hand-built polarity lexicons, use seeds and semi-supervised learning to induce lexicons)

Computational work on other affective states

Emotion:
Detecting annoyed callers to dialogue system
Detecting confused/frustrated versus confident students

Mood:
Finding tramatized or depressed writers

Interpersonal stances:
Detection of flirtation or friendliness in conversations

Personality traits:
Detection of extroverts

Detection of friendliness

Friendly speakers use collaborative conversational style
-laughter
-less use of negative emotional words
-more sympathy
-more agreement
-less hedges

Evaluation of text classification

Precision, Recall and the F measure

The 2-by-2 contingency table

tp, true positive
fp, false positive
fn, false negative
tn, true negative

accuracy = (tp+tn)/(tp+fp+fn+tn)

How often does web pages mention shoe brands?

out of a sample of 100,000 web pages it is mentioned 10 times, in 99,990 times it is not mentioned

Precision and recall

Precision: % of selected items that are correct
Recall: % of correct items that are selected

Prec=tp/(tp+fp)
Rec=tp/(tp+fn)=0/10=0

A system with zero recalls is not interesting.

If you increase recall, precision drops.

A combined measure: F

A combined measure that assessees the P/R tradeoff is F measure (weighted harmonic mean).

The harmonic mean is a very conservative average.

People usually use balanced F1 measure: F=2PR/(P+R)

Evaluation: Classic Reuters-21578 Data Set
-most used data set, 21,578 docs (each 90 types, 200 tokens)
-9603 training, 3299 test articles
-118 categories (an article can be in more than one category, learn 118 binary category distinctions)
-average document (with at least one category) has 1.24 classes
-only about 10 out of 118 categories are large

Confusion matrix c

For each pair of classes <c1,c2> how many documents from c1 were incorrectly assigned to c2?

Per class evaluation measures

Recall: fraction of docs in class i classified correctly

Precision: fraction of docs assigned class i that are actually about class i

Accuracy: (1 – error rate) fraction of docs classified correctly

Micro- vs. Macro-Averaging

If we have more than one class, how de we combine multiple performance measures into one quantity?

Macro-averaging: Compute performance for each class, then average.

Micro-averaging: Collect decisions for all classes, compute contingency table, evaluate.

Micro-averaged score is dominated by score on common classes.

Development Test Sets and Cross-Validation

Training set, development test set, unseen test set

Unseen test set
-avoid overfitting
-more conservative estimate of performance

Cross-validation over multiple splits
-handle sampling errors from different datasets

Practical Issues

The real world

No training data?
Manually written rules
-need careful crafting, time-consuming

Very little data?
Use Naive Bayes
-high-bias algorithm
Get more labeled data
Try semi-supervised training methods

A reasonable amount of data?
Perfect for all the clever classifiers
You can even use user-interpretable decision trees

A huge amount of data?
Can achieve high accuracy
At a cost, can be too slow
So Naive Bayes can come back again

Real world systems generally combine:
Automatic classification
Manual review of uncertain/difficult/”new” cases

Underflow prevention: log space

Multiplying lots of probabilities can result in floating-point underflow.
Class with highest unnormalized log probability score is still most probable.
Model is now just max of sum of weights.

How to tweak performance

Domain-specific features and weights: very important in real performance.
Sometimes need to collapse terms.
Upweighting: counting a word as if it occurred twice.

Text classification and Naive Bayes

Recognising spam
Authorship authorization
Gender indentification
Sentiment analysis – positive or negative movie review?
Scientific articles – assigning subject categories, topics or genres
Language identification

Text classification: definition

Input
a document d
a fixed set of classes C = {c1, c2, …, cj}

Output
a predicted class c from C

Classification methods

Hand-coded rules

Rules based on combinations of words or other features
-spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy an be high, if rules carefully refined by expert. But building and maintaining these rules is expensive.

Supervised machine learning

Input
a document d
a fixed set of classes C = {c1, c2, …, cj}
a training set of m hand-labeled documents (d1,c1),…,(dm,cm)

Output
a learned classifier y:d -> c

Any kind of classifier:
Naive Bayes
Logistic regression
Support-vector machines
k-Nearest Neighbours

No matter which classifier we use, the task of text classification is to take a document, its text and other kinds of features and extract features which represent the document and build a classifier that tells us which class the text document belongs to.

Naive Bayes
-one of the most important text classification methods
-based on Bayes rule
-relies on very simple representation of document – bag of words

The bag of words representation: using a subset of words and their words

Bag of words for document classification

Bayes’ Rule Applied to Documents and Classes

For a document d and a class c

P(c|d) = P(d|c)P(c) / P(d)

The probability of a class

How often does this class occur? We can just count the relative frequencies in a corpus. Coud only be estimated if a very, very large number of training examples was available.

Multinomial Naive Bayes independence assumptions

P(x1,x2,…,xn|c)

Bag of words assumption: Assume position doesn’t matter

Conditional independence: Assume the feature probabilities are independent given the class c.

Both assumptions are absolutely wrong, simplifying assumptions.

Learning the multinomial Naive Bayes model

First attempt: maximum likelihood estimates
-simply use the frequencies in the data

Parameter estimation

fraction of times word wi appears among all words in documents of topic cj

create mega-document for topic j by concatenating all docs in this topic
-use frequency f w in mega-document

Problem with maximum likelihood

What if we have seen no training documents with the word fantastic and classified in the topic positive?

Zero probabilities cannot be conditioned away, no matter the other evidence.

Solution: Laplace (add-1) smoothing for Naive Bayes

Steps:

From training corpus, extract Vocabulary

Calculate P(cj) terms
for each cj in C do
docsj <– all docs with class = cj
P(cj) <– |docsj| / |total bradgard docuemnts|

Calculate P(wk|cj) terms
textj <– single doc containing all docsj
for each word wk in Vocabulary
nk <– bradgard of occurences of wk in Textj
P(wk|cj) <– nk+alfa / n+alfa|Vocabulary|

Laplace smoothing: unknown words

Add one extra word to the vocabuary, the “unknown word” wu

Naive Bayes and language modeling

Naive Bayes classifiers can use any sort of feature
-URL, email address, dictionaries, network features

But if
-we use only word features
-we use all of the words in the text (not a subset)

The Naive Bayes has an important similarity to language modeling.

Each class = a unigram language model

assigning each word P(word|c)
assigning each sentence P(s|c)= product of P(word|c)

Which class assigns the higher probability to s?
-is like running two language models
-pick whatever model that has highest probability

priors
P(c)=Nc/N

conditional probabilities
P(w|c)= count(w,c)+1 / count(c)+|V|

Naive Bayes in spam filtering

SpamAssassin features:
mentions generic viagra
online pharmacy
subject is all capitals

Summary: Naive Bayes is not so naive

Very fast, low storage requirements

Robust to irrelevant features
-irrelevant features cancel each other without affecting results

Very good in domains with many equally important features
-decision trees suffer from fragmentation in such cases – especially if little data

Optimal if the independence assumptions hold
-if assumed independence is correct, then it is the Bayes optimal classifier for problem

A good dependable baseline for text classification

But we will se other classifiers that give better accuracy

Applications for spelling correction
-word processing
-web search
-phones

Spelling tasks
-spelling error detection
-spelling error correction: autocorrect, suggest a correction, suggest lists

Types of spelling errors
-non-word errors, graffe>giraffe
-real-word errors: typographical errors, three>there, cognitive errors (homophones), piece>peace, too>two

Rates of spelling errors
26% web queries
13% retyping, no backspace
7% words corrected retyping on phone-sized organizer
2% words uncorrected on organizer
1-2% retyping

Non-word spelling errors

Non-word spelling error detection
-any word in dictionary is an error
-the larger the dictionary the better

Non-word spelling error correction
-generate candidates, real words that are similar to error
-choose the one which is best, shortest weighted edit distance, highest noisy channel probability

Real-word spelling errors

For each word w, generate candidate set
-find candidate words with similar pronunciations
-find candidate words with similar spelling
-include w in candidate set

Choose best candidate
-noisy channel
-classifier

The Noisy Channel Model of Spelling

Noisy Channel Intuition

original word > noisy channel > noisy word > decoder > guessed word

Noisy Channel

We see an observation x of a misspelled word
Find the correct word w

channel model/error model and language model

History: Noisy channel for spelling proposed around 1990, IBM and AT&T Bell Labs.

Non-word spelling error example

acress

Candidate generation
-words with similar spelling, small edit distance to error
-words with similar pronunciation, small edit distance of pron. to error

Damerau-Levenshtein edit distance

Minimal edit distance between two string, where edits are:

-insertion
-deletion
-substitution
-transposition of two adjacent letters

Words within 1 of acress
acress, actress (deletion)
acress, cress (insertion)
acress, caress (transposition)
acress, access (substitution)
acress, across
acress, acres
acress, acres

80% of errors are within edit distance 1
almost all errors within edit distance 2

Also allow insertion of space or hyphen
thisidea > this idea
inlaw > in-law

Language Model

Use any of the language modeling algorithms we’ve learned
Unigram, bigram, trigram
Web-scale spelling correction, stupid backoff

Unigram prior probability
counts from 404,253,213 words in Corpus of Contemporary English (COCA)

Channel model probability
also error model probability, edit probability

misspelled word x=x1,x2,x3…xm
correct word w=w1,w2,w3…wn

P(x|w)=probability of the edit
(del, sub, ins, trans)

Computing error probability: confusion matrix

del[x,y]: count(xy typed as x)
ins[x,y]: count(x typed as xy)
sub[x,y]: count(x typed as y)
trans[x,y]: count(xy typed as yx)

Insertion and deletion conditioned on previous character.

Generating the confusion matrix
-Peter Norvig’s list of errors (norvig.com/ngrams/spell-errors.txt)
-Peter Norvig’s list of counts of single-edit errors

Channel model (fomulas)

Using a unigram language model, makes a mistake
Using a bigram language model, picks correct word

Evaluation

Some spelling error test sets
-Wikipedia’s list of common English misspelling
-Aspell filtered version of that list
-Birkbeck spelling error corpus
-Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training or testing)

Real-word spelling errors
25-40% of spelling errors are real words

Solving real-word spelling errors

For each word in sentence, generate candidate set
-the word itself
-all single-letter edits that are English words
-words that are homophones

Choose best candidates
-noisy channel model
-task-specific classifier

Noisy channel for real-word spell correction

Given a sentence w1,w2,w3…wn
Generate a set of candidates for each word wi
Choose the sequence W that maximizes P(W)

two of thew …

two: to, tao, too, two
of: off, on, of
thew: threw, thaw, the, thew

of all the possible sets of sentences produced by the graph, what’s the most probable one according to noisy channel model?

Simplification: One error per sentence
-two off thew
-two of the
-too of thew

Out of all possible sentences with one word replaced, choose the sequence W that maximizes P(W).

Where to get the probabilities?

Language model: unigram, bigram, etc
Channel model: same as for non-word spelling correction plus need probability for no error P(w|w)

Probability of no error

What is the channel probability for a correctly typed word?
Obviously this depends on the application.

Peter Norvig’s “thew” example

State of the art systems

HCI issues in spelling
-if very confident in correction – autocorrect
-less confident – give the best correction
-even less confident – give a correction list
-unconfident – just flag as an error

State of the art noisy channel
-we never just multiply the prior and the error model
-independence assumptions > probabilities not commensurate
-instead: weight them, learn lambda from a development test set

Phonetic error model

Metaphone, used in GNU aspell: convert misspelling to metaphone pronunciation, for example
-drop duplicate adjacent letters, except for c
-if the word begins with kn, gn, pn, ae, wr, drop first letter

Find words whose pronunciation is 1-2 edit distance from misspelling: score result list
-weighted edit distance of candidate to misspelling
-edit distance of candidate pronunciation to misspelling pronunciation

Improvements to channel model
-allow richer edits
-incorporate pronunciation into channel

Factors that could influence p(misspelling|word)
-the source letter
-the target letter
-surrounding letters
-the position in the word
-nearby keys on the keyboard
-homology on the keyboard
-pronunciations
-likely morpheme transformations

Classifier-based methods for real-word spelling correction

Instead of just channel model and language model, use many features in classifier.

Smoothing: Add-one (Laplace) smoothing

The intuition of smoothing

When we have sparse statistics, steal probability mass to generalize better, so zeroes go away.

Add-one estimation
-also called Laplace smoothing
-pretend we saw each word one more time than we did
-just add one to all the counts

MLE estimate
Add-1 estimate

Reminder: Maximum Likelihood Estimate
-of some parameter of a model M from a training set T
-maximizes the likelihood of the training set T given the model M

Suppose the word “bagel” occurs 400 times in a corpus of a million words. What is the probability that a random word from some other text will be “bagel”?

MLE estimate is 400/1,000,000=0.0004

This may be a bad estimate for some other corpus, but it is the estimate that makes it most likely that “bagel” will occur 400 times in a million word corpus.

Reconstituted counts. How much has the add-one smoothing changed the probabilities?

Add-1 estimation is a blunt instrument. It gives massive changes to probability counts.

So add-1 ins’t used for N-grams. We’ll se better methods. But add-1 is used to smooth other NLP models, like text classification or in domains where the number of zeros isn’t so huge.

Backoff and Interpolation

Sometimes it helps to use less context.

Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram

Interpolation: mix unigram, bigram, trigram

Interpolation works better.

Linear Interpolation

Two kinds: Simple interpolation and lambdas conditional on context

How to set the lambdas?

Use a held-out corpus: training data, held-out data, test data

Choose lambdas to maximize the probability of held-out data
-fix the N-gram probabilities (on the training data), then search for lambdas that give largest probability to held out set

Unknown words: Open versus closed vocabulary tasks

If we know all the words in advance
-vocabulary V is fixed
-closed vocabulary task

Often we don’t know this
-Out Of Vocabulary = OOV words
-open vocabulary task

Instead, create an unknown word token <UNK>
-training of <UNK> probabilities: create a fixed lexicon L of size V, at text normalization phase any training word not in L changed to <UNK>, now we train its probabilities like a normal word

At decoding time
-if text input, use UNK probabilities for any word not in training

Huge web-scale N-grams

How to deal with Google N-gram corpus

Pruning
-only store N-grams with count>threshold, remove singletons of higher-order n-grams
-entropy-based pruning

Efficiency
-efficient data structures like tries
-bloom filters: approximate language models
-store words as indexes, not strings, use Huffman coding to fit large numbers of words into two bytes
-quantize probabilities (4-8 bits instead of 8-byte float)

Smoothing for web-scale N-grams
-”stupid backoff”
-no discounting, just use relative frequencies

N-gram smoothing summary:

Add-1 smoothing
-ok for text categorization, not for language modeling

The most commonly used method
-extended interpolated Kneser-Ney

For very large N-grams like the web
-stupid backoff

Advanced Language Modeling

Discriminative models
-choose n-gram weights to improve a task, not to fit the training set

Parsing-based models

Caching models
-recently used words are more likely to appear
-these perform very poorly for speech recognition. Why?

Advanced: Good Turing Smoothing

Add-1 smoothing
Add-k smoothing
Unigram prior smoothing

Advanced smoothing algorithms

Intuition used by many smoothing algorithms, Good-Turing, Kneser-Ney, Witten-Bell.

Use the count of things we’ve seen once to help estimate the count of things we’ve never seen.

Notation:

Nc = Frequency of frequency c

Nc= the count of things we’ve seen c times

Sam I am I am Sam I do not eat

I – 3
Sam – 2
am – 2
do – 1
not – 1
eat – 1

N1=3
N2=2
N3=1

Good-Turing smoothing intuition

Imagine that you are fishing and caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that next species is trout?
1/18

How likely is it that next species is new (catfish or bass)?
let’s use our estimate of things-we-saw-once to estimate the new things
3/18 (because N1=3)

Assuming so, how likely is it that next species is trout?
must be less than 1/18, we’ve used some of our probability mass for new fish. how to estimate?

Good-Turing calculations

Unseen (bass or catfish)
c=0
MLE p=0/18=0
P*(unseen)=N1/N=3/18

Seen once (trout)
c=1
MLE p=1/18
c*(trout)=2*N2/N1=2*1/3=2/3
P*(trout)=(2/3)/18=1/27

Ney et al. Good Turing Intuition

Held-out words
pull out one word from a training set

Intuition from leave-one-out validation
-take each of the c training words out in turn
-c training sets of size c-1, held-out of size 1
-what fraction of held-out words are unseen in training? N1/c
-what fraction of held-out words are seen k times in training? (k+1)Nk+1/c
-so in the future we expect (k+1)Nk+1/c of the words to be those with training count k
-there are Nk words with training count k
-each should occur with probability (k+1)Nk=1/c/Nk
or expected count k*=(k+1)Nk+1/Nk

Good-Turing complications

Problem: what about “the”? (say c=4417)
for small k, Nk>Nk+1
for large k, too jumpy, zeroes wreck estimates

Simple Good-Turing: replace empirical Nk with a best-fit power law once count counts get unreliable

Resulting Good-Turing numbers

Numbers from Church and Gale
22 million words of AP Newswire
each count has been discounted to leave room for zeroes
what’s the relationship between the count and the Good-Turing count?
fixed small discount

Kneser-Ney Smoothing

Absolute discounting interpolation
save ourselves some time and just subtract 0.75 or some d

But should we really just use the regular unigram P(w)?

Kneser Ney Smoothing I

Better estimate for probabilities of lower-order unigrams.
Shannon game: I can’t see without my reading ___
“Francisco” is more common than “glasses”
but “Franscisco” always follows “San”.

The unigram is useful exactly when we haven’t seen this bigram.

Instead of P(w): How likely is w?
Pcontinuation(w): How likely is w to appear as a novel continuation?
-for each word count the number of bigram types it completes
-every bigram type was a novel continuaton the first time it was seen

Kneser-Ney Smoothing II

How many times does w appear as a novel continuation, normalized by the total number of word bigram types.

Kneser-Ney Smoothing III

Alternative metaphor: the number of # of word types seen to precede w, normalized by the # of words preceding all words.

A frequent word (Francisco) occurring in only one contect (San) will have a low continuation probability.

Kneser-Ney Smoothing IV

Language Modeling

Goal: assign a probability to a sentence

Why?
-machine translation, P(high winds tonight)>P(large winds tonight)
-spell correction, “The office is about fifteen minuets from my house”, P(about fifteen minutes from)>P(about fifteen minuets from)
-speech recognition, P(I saw a van)>P(eyes awe of an)
-summarization, question-answering, etc.

Probabilistic Langague Modeling

Goal: compute the probability of a sentence or sequence of words, P(W)=P(w1,w2,w3,w4,w5…wn)

Related task: probability of an upcoming word, P(w5|w1,w2,w3,w4)

A model that computes either of these is called a language model. “The grammar” would be better, but langague model or LM is standard.

How to compute P(W)

how to compute joint probability: P(its, water, is, so, transparent, that)

intuition – let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule

recall the definition of conditional probabilities
P(A|B)=P(A,B)/P(B)

rewriting
P(A|B)P(B)=P(A,B)
P(A,B)=P(A|B)P(B)

more variables
P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)

the chain rule in general
P(x1,x2,x3,…,xn)=P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,Xn-1)

The Chain Rule applied to compute joint probability of words in sentence:

P(“its water is so transparent”)=P(its)*P(water|its)*P(is|its water)*P(so|its water is)*P(transparent|its water is so)

How to estimate these probabilities

Could we just count and divide?

P(the|its water is so transparent that)=Count(its water is so transparent that the)/Count(its water is so transparent that)

No, too many possible sentences. We’ll never see enough data for estimating these.

Markov Assumption

simplifying assumption:
P(the|its water is so transparent that) ~ P(the|that)

or maybe
P(the|its water is so transparent that) ~ P(the|transparent that)

We approximate each component in the product.

Simplest case: Unigram model
Slightly more sophisticated: Bigram model – condition on the previous word

N-gram models

We can extend to trigrams, 4-grams, 5-grams. In general this is an insufficient model of language, because language has long-distance dependencies.

“The computer which I had just put into the machine room on the fifth floor crashed.”

But in practice we can often get away with N-gram models.

Estimating bigram probabilities

The Maximum Likelihood Estimate (formula)

An example:

<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

P(I|<s>)=2/3=0.67
P(<s>|Sam)=1/2-0.5
P(Sam|<s>)=1/3=0.33
P(Sam|am)=1/2=0.5
P(am|I)=2/3=0.67
P(do|I)=1/3=0.33

More examples: Berkeley Restaurant Project sentences
-can you tell me about any good cantonese restaurants close by
-mid priced thai food is what i’m looking for
-tell me about chez panisse
-can you give me a listing of the kinds of food that are available
-i’m looking for a good place to eat breakfast
-when is caffe venezia open during the day

Raw bigram counts
Normalize by unigrams

Bigram estimates of sentence probabilities
P(<s> I want english food </s>)=
P(I|<s>)*P(want|I)*P(english|want)*P(food|english)*P(</s>|food)

What kinds of knowledge?

P(english|want)=0.0011
P(chinese|want)=0.0065

fact about the world, chinese food more popular, more people want it

P(to|want)=0.66

want>infinitive, grammatical fact

In practice we do everything in log space
-avoid underflow
-also adding is faster than multiplying

Language Modeling Toolkits

publicly available:
SRILM
Google N-Gram
Google Book N-grams

Evaluation and Perplexity

Evaluation: How good is our model?

Does our language model prefer good sentences to bad ones?
-assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences?
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen
-a test set is an unseen dataset that is different from our training set, totally unused
-an evaluation metric tells us how well our model does on the test set

Extrinsic evaluation of N-gram models

Best evaluation for comparing models A and B
-put each model in a task (spelling corrector, speech recognizer, MT system)
Run the task, get an accuracy for A and B
-how many misspelled words corrected properly
-how many words translated correctly
Compare accuracy for A and B

Difficulty of extrinsic (in-vivo) evaluation of N-gram models

-time-consuming, can take days or weeks
-so instead we use intrinsic evaluation
-the most common is called perplexity
-bad approximation unless the test data looks just like the training data
-generally only useful in pilot experiments, but is helpful to think about

Intuition of Perplexity

The Shannon Game: How well can we predict the next word?
I always order pizza with cheese and ___
The 33rd President of the US was ___
I saw a ___

Unigrams are terrible at this game. Why?

A better model of a text is one which assigns a higher probability to the word that actually occurs.

The best language model is one that best predicts an unseen test set
-gives the highest P(sentence)

Perplexity is the probability of the test set, normalized by the number of words.

Minimizing perplexity is the same as maximizing probability.

The Shannon Game intuition for perplexity
-from Josh Goodman
-average branching factor
-how hard is the task of recognizing digits ’0,1,2,3,4,5,6,7,8,9′ – perplexity 10
-how hard is recognizing (30,000) names at Microsoft – perplexity 30,000
-if a system has to recognize Operator (1 in 4), Sales (1 in 4), Technical Support (1 in 4) or 30,000 names (1 in 120,000 each) – perplexity is weighted equivalent branching factor, perplexity is 53

Perplexity as branching factor

Let’s suppose a sentence consisting of random digits. What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity=better model

Generalization and zeros

The Shannon Visualization Method
-choose a random bigram (<s>,w) according to its probability
-now choose a random bigram (w,x) according to its probability
-and so on until we choose </s>
-then string the words together

Approximating Shakespeare

Shakespeare as corpus
N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of 844 million possible bigrams.

So 99.96% of the possible bigrams were never seen (have zero entries in the table).

Quadrigrams worse: What’s coming out looks like Shakespeare because it is Shakespeare.

The perils of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus. In real life, it often doesn’t. We need to train robust models that generalize. One kind of generalization: Zeros (things that never occur in the training set, but do occur in the test set).

Zeros

Training set:
… denied the allegations
… denied the reports
… denied the claims
… deined the request

P(“offer”|denied the)=0

Test set:
… denied the offer
… denied the loan

Probability will be zero. Big problem!

Zero probability bigrams

Bigrams with zero probability mean that we will assign 0 probability to the test set and hence we cannot compute perplexity (can’t divide by 0).

Defining minimum edit distance

How similar are two strings?
-spell correction
-computational biology

The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substition) needed to transform one into the other.

two strings and their alignment

INTENTION
EXECUTION

del, sub, sub, ins, sub

if each operation has cost of 1, distance in this case is 5
if substituions cost 2 (Levenshtein), distance is 8

Alignment in computational biology
-given two sequences, align each letter to a letter or gap

Other uses of edit distance in NLP
-evaluating machine translation and speech recognition
-named entity extraction and entity coreference

How to find the minimum edit distance?

Searching for a path (sequence of edits) from the start string to the final string
-inital state: the word we’re transforming
-operators: insert, delete, substitute
-goal state: the word we’re trying to get to
-path cost: what we want to minimize, the number of edits

Minimum edit as search

But the space of all edit sequences is huge
-can’t afford to navigate naively
-lots of distinct paths wind up at the same state, don’t have to keep track of all of them, just the shortest path to each of those revisited states

Definition of minimum edit distance

For two strings
X of length n
Y of length m
D(i,j) is the edit distance between X[1..i] and Y[1..j]
the edit distance between X and Y is thus D(n,m)

Computing minimum edit distance

Dynamic programming: A tabular computation of D(n.m)
Solving problems by combining solutions to subproblems
Bottom-up

Equation defining minimum edit distance (Levenshtein)

The edit distance table

Backtrace for computing alignments

Computing alignments
-Edit distance isn’t sufficient
-We keep a “backtrace”
-Every time we enter a cell, remember where we came from. When we reach the end, trace back the path from the upper right corner to read off the alignment.

Adding backtrace to minimum edit distance

The distance matrix
-Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences.
-An optimal alignment is composed of optimal subalignments.

Result of backtrace
two strings and their alignment

Performance
Time: O(nm)
Space: O(nm)
Backtrace: O(n+m)

Weighted edit distance

Why add weight?
-Spell correction
-Biology

Minimum edit distance in computational biology

Why sequence alignment?

Comparing genes or regions from different species
Assembling fragments to sequence DNA
Compare individuals to looking for mutations

Alignments in two fields
-In NLP we talk about distance (minimized) and weights
-In Computational Biology we talk about similarity (maximized) and scores

The Needleman-Wunsch algorithm

The Needleman-Wunsch matrix

A variant of the basic algorithm
-unlimited number of gaps in beginning and end, we don’t want to penalize this
-different types of overlaps

The overlap detection variant

The local alignment problem
Given two strings find substrings whose simmilarity is maximum

The Smith-Waterman algorithm
Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch

Basic text editing

Regular expressions
formal language for specifying text strings

Disjunctions

any letters inside square brackets [ ]
ranges [A-Z]
negations [^Ss] caret ^ means “do not match”
pipe | search for either or [groundhog|woodchuck]

special characters:
? previous character optional
* zero or more of previous characters
+ one or more of previous characters
. any character

anchors:
^ matches beginning of the line
$ matches end of the line
\ escape, if you want to search for a period which is a special character you have to escape it [\.]

Errors

False positives (Type I) matching strings that we should not have matched
False negatives (Type II) not matching things that we should have matched

constantly dealing with these errors

reducing error rate means:
-increasing accuracy or precision
-increasing coverage or recall

Word tokenization

Text normalization

Every NLP task needs to do text normalization:
-segmenting/tokenizing words in running text
-normalizing word formats
-segmenting sentences in running text

How many words is in a sentence?
-fragments, filled pauses
-lemma: same stem, part of speech, rough word sense,
wordform: the full inflected surface form

Type: an element of the vocabulary
Token: an instance of that type in running text

N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary

Issues in tokenization

apostrophe
abbreviated words
double names
hyphenated words
lowercase
New York – one or two tokens?

even more problematic in other languages
in German noun compounds are not segmented, German information retrieval needs compound splitter
Chinese/Japanese: no spaces between words
further complicated in Japanese, mutiple alphabets intermingled

Word tokenization in Chinese, also called word segmentation

Chinese words are composed of characters
characters are generally 1 syllable and 1 morpheme
average word is 2.4 characters long

standard baseline segmentation algorithm: maximum matching (also called greedy) doesn’t generally work in English, but very well in Chinese

Word normalization and stemming

Normalization

need to “normalize” terms
implicitly define equivalence classes of terms
alternative: assymetric expansion
potentially more powerful, but less efficient

Case folding

reduce all letters to lower case
single users tend to use lower case
possible exception: upper case in mid-sentence
for sentiment analysis, MT, information extraction case is helpful

Lemmatization

reduce inflections or variant forms to base form
finding the correct dictionary headword form
machine translation

Morphology

morphemes: the small meaningful units that make up words
stems: the core meaning-bearing units
affixes: bits and pieces that adhere to stems, often grammatical functions

Stemming

reduce terms to their stems in information retrieval
stemming is crude chopping of affixes
language dependent
e.g. automate(s), automatic, automation all reduced to automat

Porter’s algorithm – most common English stemmer

set of rules, for example:
sses > ss
ies > i
ss > ss
s > 0

Viewing morphology in a corpus
Why only strip -ing if there is a vowel?
rule: (*v*)ing > 0

Dealing with complex morphology is sometimes necessary
Some languages requires complex morpheme segmentation, Turkish

Sentence segmentation and decision trees

Sentence segmentation

!, ? are relatively unambiguous
period “.” is quite ambiguous (sentence boundary, abbreviations, numbers)

to solve the period problem, build a binary classifier:
looks at a “.”
decides EndOfSentence/NotEndOfSentence
to build them use hand-written rules, regular expressions or machine-learning

Determining if a word is end-of-sentence: Decision tree

more sophisticated decision tree features:
-case of word with “.”
-case of word after “.”
-numeric features: length of word, probability

Implementing decision trees

a decision tree is just an if-then-else statement
the interesting research is choosing the features
too hard to build by hand

Decision trees and other classifiers

questions in decision tree could be used in any kind of classifier

The course in Natural Language Processing from Stanford University has started. Here are notes from the first introduction video.

Natural Language Processing

Applications: Information Extraction, Sentiment Analysis, Machine Translation

Language Technology

mostly solved:
Spam detection
Part-of-speech tagging (POS)
Named entity recognition (NER)

making good progress:
Sentiment analysis
Coreference resolution
Word sense disambiguation (WSD)
Parsing
Machine translation (MT)
Information extraction (IE)

still really hard:
Question answering (QA)
Paraphrase
Summarization
Dialog

Ambiguity makes NLP hard

headline: Violinist Linked to JAL Crash Blossoms
might think that the violinist is linked to “JAL Crash Blossoms” whatever that means
actual meaning was: Violinist (who was linked to JAL crash) blossoms
another example: Red Tape Holds Up New Bridges (hold up can mean 1 delay 2 to support)

Ambiguity is pervasive

New York Times headline: Fed raises interest rates
parser would also see verb “interest”, not only “raises”

even more difficult: Fed raises interest rates 0.5%
“rates” could be interpreted as a verb

Why else is natural language understanding difficult?

non-standard English
segmentation issues
idioms
neologisms
world knowledge
tricky entity names

Making progress on this problem is difficult

tools: knowledge about language, knowledge about the world, a way to combine knowledge sources
how we do this: probabilistic models built from language data
rough text features can do some of the job

Jag har uppdaterat mig lite på hur man kan använda sociala nätverk i research och tänkte dela med mig av de bästa tipsen som jag hittade.

Linkedin

Några tips finns att hämta från Linkedins egen guide för journalister. Poynter har listat tio sätt för reportrar att använda Linkedin för att hitta källor och följa förändringar i företag.

Twitter

Twitter har också en egen guide för journalister och Knights Digital Media Center har gjort en omfattande guide för nybörjare. Mer avancerade tips finns i guiderna från Poynter, Insite och Steve Buttry.

För att söka efter tweets kan man använda Twitters vanliga sökfunktion eller den avancerade sökfunktionen. Twitter har även sökoperatorer som kan förbättra sökningen. För att hitta tweets som är äldre än en vecka kan man använda Topsy. När man är inloggad på Twitter kan man spara sökningarna som man gör.

Google+

Mashable skriver om fem sätt journalister använder Google+ på. De har även en lång guide till hur det fungerar.

Facebook

Knights Digital Media Center har en ordentlig guide även för Facebook och på den här Facebook-sidan finns ännu fler tips för journalister. Poynter skriver om vad journalister måste veta om den nya Subscribe-funktionen.

Men det ni inte får missa är att läsa om hur Josephine Freje Simonsson och Linnea Hambe använder till exempel Facebook och Twitter för att göra research “undercover”. Se även deras föreläsning från Gräv 11.