Description
COMS 4771 HW1
1 Analyzing Bayes Classifier
Consider a binary classification problem where the output variable Y 2 {0, 1} is fully determined by variables A, B and C (each in R). In particular Y = ( 1 if A + B + C < 7 0 otherwise . (i) Let each A, B and C be i.i.d. exponential random variable (with mean parameter = 1). (a) Suppose the variables A and B are known but C is unknown, compute P[Y = 1|A, B], the optimal Bayes classifier, and the corresponding Bayes error. (b) If only the variable A is known but the variables B and C are unknown, compute P[Y = 1|A], the optimal Bayes classifier, and the corresponding Bayes error. (c) If none of the variables A, B, C are known, what is the Bayes classifier and the corresponding Bayes error? (ii) Assume that variables A, B and C are independent. For known A and B, show that there exists a distribution on C for which the Bayes classification error rate can be made as close to 1/2 as desired.
2 Classification with Asymmetric Costs
As discussed in class, occasionally the standard misclassification error is not a good metric for a given task. In this question, we consider binary classification (Y = {0, 1}) where the two possible types of errors are not symmetric. Specifically, we associate a cost of p > 0 for outputting 0 when the class is 1, and a cost of q > 0 for outputting 1 when the class is 0.
Furthermore, we allow the classifier to output -1, indicating an overall lack of confidence as to the label of the provided 1 example, and incur a cost r > 0. Formally, given a classifier f : X ! {1, 0, 1} and an example x 2 X with label y 2 {0, 1} we define the loss of f with respect to example (x, y) by: `(f(x), y) = 8 >>>< >>>: 0 if f(x) = y p if f(x)=0 and y = 1 q if f(x)=1 and y = 0 r if f(x) = 1 (i) Describe a real world scenario where this model of classification may be more appropriate than the standard model seen in class. (ii) Assume that r < pq p+q . Let f ⇤ be defined as: f ⇤(x) = 8 >< >: 0 if 0 ⌘(x) r p 1 if r p < ⌘(x) < 1 r q 1 if 1 r q ⌘(x) 1 . where ⌘(x) = Pr[Y = 1 | X = x].
Show that f ⇤ is the Bayes classifier for this model of classification. Specifically, show that for any other classifier g : X ! {1, 0, 1} we have: Ex,y `(f ⇤(x), y) Ex,y `(g(x), y). (iii) Assume that r pq p+q . Show that the Bayes classifier is given by: f ⇤(x) = ( 0 if 0 ⌘(x) q p+q 1 if q p+q ⌘(x) 1 . where ⌘(x) = Pr[Y = 1 | X = x]. (iv) Using the previous parts, show that when p = q and r > p/2 the Bayes classifier is the same as the one we derived in class. Explain intuitively why this makes sense.
3 Finding (local) minima of generic functions Finding extreme values of functions in a closed form is often not possible. Here we will develop a generic algorithm to find the extremal values of a function. Consider a smooth function f : R ! R. (i) Recall that Taylor’s Remainder Theorem states: For any a, b 2 R, exists z 2 [a, b], such that f(b) = f(a) + f0 (a)(b a) + 1 2 f00(z)(b a)2. Assuming that there exists L 0 such that for all a, b 2 R, |f0 (a)f0 (b)| L2|ab|, prove the following statement: For any x 2 R, there exists some ⌘ > 0, such that if x¯ := x ⌘f0 (x), then f(¯x) f(x), with equality if and only if f0 (x)=0.
(Hint: first show that the assumption implies that f has bounded second derivative, i.e., f00(z) L2 (for all z); then apply the remainder theorem and analyze the difference f(x) f(¯x)). 2 (ii) Part (i) gives us a generic recipe to find a new value x¯ from an old value x such that f(¯x) f(x). Using this result, develop an iterative algorithm to find a local minimum starting from an initial value x0. (iii) Use your algorithm to find the minimum of the function f(x) := (x3)2 + 4e2x.
You should code your algorithm in a scientific programming language like Matlab to find the solution. You don’t need to submit any code. Instead you should provide a plot of how the minimum is approximated by your implementation as a function of the number of iterations. (iv) Why is this technique only useful for finding local minima? Suggest some possible improvements that could help find global minima.
4 Exploring the Limits Current Language Models Recently, OpenAI launched their Chat Generative Pre-Trained Transformer (ChatGPT), a powerful language model trained on top of GPT-3. Models like ChatGPT and GPT-3 have demonstrated remarkable performance in conversation and Q&A and overall large generative modeling. Despite their impressive performance, the sentences produced by such models are not “foolproof”. Here we will explore how effectively can one distinguish between human vs. AI generated written text.
The classification model which you will develop can also be used to generate new sentences in a similar fashion to ChatGPT! One of the simplest language models is N-grams. An N-gram is a sequence of N consecutive words w1:N . The model calculates the probability of a word wi appearing using only the N 1 words before it. This local independence assumption can be considered as a Markov-type property. For the bigram model, the probability of a sequence of words w1, w2,…,wn appearing is P(w1:n) = Yn i=1 P(wi|wi1). We use maximum likelihood estimation (MLE) to determine the conditional probabilities above. Given a training corpus D, let C(wi1wi) denote the number of times the bigram (wi1, wi) appears in D. Then we can approximate the conditional probability as P(wi|wi1) = P(wi1wi) P(wi1) = approx. C(wi1wi) C(wi1) .
The probability that a sequence of words is from class y is P(y|w1:n) = P(y, w1:n) P(w1:n) = P(y)P(w1:n|y) P y˜ P(˜y)P(w1:n|y˜) , where P(y) is the prior. Formally, the bigram classifier is f(w1:n) = arg max y P(y|w1:n) = arg max y P(y)P(w1:n|y). To address out of vocabulary (OOV) words (i.e. N-grams in the test set but not in the training corpus), N-gram models often employ “smoothing”. One common technique is Laplacian smoothing, which assumes that every N-gram appears exactly one more time than actually observed. Given a vocabulary V , the conditional probability becomes P(wi|wi1) = approx. C(wi1wi)+1 C(wi1) + |V | . 3 Note that the denominator is increased by |V | so the probabilities sum to one. (i) Calculate the class probability P(y|w1:n) for a trigram model with Laplacian smoothing. Download the datafile humvgpt.zip1.
This file contains around 40, 000 human written and 20, 000 ChatGPT written entries. The data is stored in separate text files named hum.txt and gpt.txt respectively. (ii) (a) Clean the data by removing all punctuation except “,.?!” and converting all words to lower case. You may also find it helpful to add special and tokens for calculating N-gram probabilities. Partition 90% of the data to a training set and 10% to test set. (b) Train a bigram and trigram model by finding the N-gram frequencies in the training corpus. Calculate the percentage of bigrams/trigrams in the test set that do not appear in the training corpus (this is called the OOV rate). (you must submit your code to get full credit) (c) Evaluate the model on the test set and report the classification accuracy. Which model performs better and why? Your justification should consider the bigram and trigram OOV rate.
This study will also tell you how difficult or easy it is to distinguish human vs. AI generated text! Besides classification, N-gram models may also be used for text generation. Given a sequence of n 1 previous tokens win+2:i, the model selects the next word with probability P(wi+1 = w) = P(w|win+2:i) = approx. exp(C(win+2:iw)/T) P w˜ exp(C(win+2:iw˜)/T) .
In the above equation, T is referred to as the temperature. (iii) (a) Using T = 50, generate 5 sentences of 20 words each with the human and ChatGPT bigram/trigram models. Which text corpus and N-gram model generates the best sentences? What happens when you increase/decrease the temperature? (b) Nowadays, language models (e.g. transformers) use an attention mechanism to capture long-range context. Why do sentences generated from the n-gram model often fail in this regard? 1The data is adapted from https://huggingface.co/datasets/Hello-SimpleAI/HC3.