Input Data – kddcup.data_10_percent.gz 10% subset. (2.1M; 75M Uncompressed) from
Please read the paper which is provided with your assignment in the Quercus and answer
the following question.
1. [Marks: 15] What is an Intrusion Detection System? Is it possible to implement an
Intrusion Detection System on this dataset? Explain the workflow as described in the
paper for implementing Intrusion Detection System.
This part needs to be done by using PySpark or Spark-SQL in Databricks.
2. [Marks: 5] Use the python urllib library to extract the KDD Cup 99 data from their web
repository, store it in a temporary location and then move it to the Databricks filesystem
which can enable easy access to this data for analysis. Use the following commands in
Databricks to get your data.
3. [Marks: 5] After storing the data in the Databricks filesystem. Load your data from the
disk into Spark’s RDD. Print 10 values of your RDD and verify the type of data
structure of your data (RDD).
4. [Marks: 5] Split the data. (Each entry in your RDD is a comma-separated line of data,
which you first need to split before you can parse and build your data frame.) Show
the total number of features (columns) and print results. See this link for more details.
5. [Marks: 5] Now extract these 6 columns (duration, protocol_type, service, src_bytes,
dst_bytes, flag and label) from your dataset. Build a new RDD and data frame. Print
schema and display 10 values.
6. [Marks: 5] Get the total number of connections based on the protocol_type and based
on the service. Show the result in ascending order. Plot the bar graph for both.
7. [Marks: 15] Do a further exploratory data analysis, including other columns of this
dataset and plot graphs. Plot at least 3 different charts/plots and explain them.
8. [Marks: 20] Look at the label column where label == ‘normal’. Now create a new label
column where you have a label == ‘normal’ and everything else is considered as an
‘attack’. Split your data (train/test) and based on your new label column now build a
simple machine learning model for intrusion detection (you can use a few selected
columns for your model out of all). Apply 2 different algorithms.
9. [Marks: 15] Explain which algorithms you have selected and why? Show the results
for both with some success metrics and describe the best one.
1. [Marks: 2] Read the below statements, choose the correct answer, and provide
explanations. You can get more information by visiting this link.
Statements Yes No
1. A platform as a service (PaaS) solution that hosts
web apps in Azure provide professional
development services to continuously add
features to custom applications.
2. A platform as a service (PaaS) database offering in
Azure provides built-in high availability.
2. [Marks: 2] Read the below statement, choose the correct answer, and provide
A relational database must be used when:
a. A dynamic schema is required
b. Data will be stored as key/value pairs
c. Storing large images and videos
d. Strong consistency guarantees are required
3. [Marks: 2] Read the below statement, choose the correct answer, and provide
When you are implementing a Software as a Service solution, you are responsible for:
a. Configuring high availability
b. Defining scalability rules
c. Installing the SaaS solution
d. Configuring the SaaS solution
4. [Marks: 2] Read the below statements, choose the correct answer, and provide
Statements Yes No
1. To achieve a hybrid cloud model, a company
must always migrate from a private cloud model
2. A company can extend the capacity of its internal
network by using a public cloud
3. In a public cloud model, only guest users at your
The company can access the resources in the cloud
5. [Marks: 2] Read the below statements, choose the correct answer, and provide
a. A cloud service that remains available after a failure occurs ______________
b. A cloud service that can be recovered after a failure occurs _______________
c. A cloud service that performs quickly when demand increases ____________
d. A cloud service that can be accessed quickly from the internet _____________
Disaster recovery, Fault Tolerance, Low Latency, Dynamic Scalability