Stat4DS / Homework 01 Random Projections

$25.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

Exercise 1: Randomize this. . .
1. Background: Random Projections for Data Sketching (more info)
Imagine a network switch which must process dozens of gigabytes per second, and may only have a few kilobytes or megabytes
of fast memory available. Imagine also that, from the huge amount of traffic passing by the switch, we wish to compute basic
stats like the number of distinct traffic flows (source/destination pairs) traversing the switch, the variability of the packet
sizes, etc. Lots of data to process, not a lot of space: the perfect recipe for an epic infrastructural failure.
Streaming model of computation to the rescue! The model aims to capture scenarios in which a computing device with a
very limited amount of storage must process a huge amount of data, and must compute some aggregate summary statistics
about that data.
Let us formalize this model using a frequency vector x.
The raw stream is a growing sequence of indices Dn = (i1, i2, . . . , ik, . . . , in), where each ik ∈ {data-alphabet} = {1, . . . , d},
where d, the size of the “data-alphabet”, may be very large. For our developments, think d way larger than n, possibly infinite!
At an abstract level, the goal is to initialize, store and update sequentially1
the frequency vector x = [x1, . . . , xd]
T with
xj = {number of indices ik in Dn equal to j} = cardinality
{k : ik = j}

, j ∈ {1, . . . , d},
and then to output some properties of x at the end of the stream, such as its length, or the number of non-zero entries, etc.
If the algorithm were to maintain x explicitly, it could initialize x
(0) ← [0, 0, . . . , 0]T
, then, at each step k, it receives the
index ik and just increments x
(k−1)
ik
by 1 to get the updated frequency vector x
(k)
. Notice that, in terms of notation, x ≡ x
(n)
.
Given this sequential, explicit representation of x at the end of the stream, one can easily compute the desired properties.
1That is, on-the-fly, just by “looking” at each ik as they appear without storing the raw data in Dn at any step.