Week 9(7-11 October)
ISB Videos
This
week we ended our ISB's Machine Learning course with the last two
lectures which were on Text Analysis and Mining graphs. The topics
covered were as follows:
Word2vec
Word2vec
is a two-layer neural net that processes text. Its input is a text
corpus and its output is a set of vectors: feature vectors for words in
that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.
The
purpose and usefulness of Word2vec are to group the vectors of similar
words together in vector space. That is, it detects similarities
mathematically. Word2vec creates vectors that are distributed numerical
representations of word features, features such as the context of
individual words. It does so without human intervention.
There
are two types of Word2Vec, Skip-gram and Continuous Bag of Words
(CBOW). I will briefly describe how these two methods work in the
following paragraphs.
Skip-gram
Words are read into the vector one at a time and scanned back and forth within a certain range. Those ranges are n-grams, and an n-gram is a contiguous sequence of n items
from a given linguistic sequence; it is the nth version of unigram,
bigram, trigram, four-gram or five-gram. A skip-gram simply drops items
from the n-gram.
The graph below visualizes the network structure.
Continuous
Bag of Words (CBOW) is very similar to skip-gram, except that it swaps
the input and output. The idea is that given a context, we want to know
which word is most likely to appear in it.
The biggest difference between Skip-gram and CBOW is that the way the word vectors are generated.
For CBOW, all the examples with the target word as a target are fed
into the networks and taking the average of the extracted hidden layer.
For example, assume we only have two sentences, “He is a nice guy” and
“She is a wise queen”. To compute the word representation for the word
“a”, we need to feed in these two examples, “He is a nice guy”, and “She
is wise queen” into the Neural Network and take the average of the
value in the hidden layer. Skip-gram only feeds in the one and only one
target word one-hot vector as input.
It
is claimed that Skip-gram tends to do better in rare words.
Nevertheless, the performance of Skip-gram and CBOW are generally
similar.
Mining Graphs
The Web as a Directed Graph
Viewing
social and economic networks in terms of their graph structures provides
significant insights, and the same is true for information networks
such as the Web. When we view the Web as a graph, it allows us to better
understand the logical relationships expressed by its links; to break
its structure into smaller, cohesive units; and to identify important
pages as a step in organizing the results of Web searches.
Early Search Engines and Term Spam
Techniques
for fooling search engines into believing your page is about something
it is not are called term spam. The ability of term spammers to operate
so easily rendered early search engines almost useless. To combat term
spam, Google introduced two innovations:
1.
PageRank was used to simulate where Web surfers, starting at a random
page, would tend to congregate if they followed randomly chosen outlinks
from the page at which they were currently located, and this process
was allowed to iterate many times. Pages that would have a large number
of surfers were considered more “important” than pages that would rarely
be visited. Google prefers important pages to unimportant pages when
deciding which pages to show first in response to a search query.
2.
The content of a page was judged not only by the terms appearing on
that page but by the terms used in or near the links to that page. Note
that while it is easy for a spammer to add false terms to a page they
control, they cannot as easily get false terms added to the pages that
link to their own page, if they do not control those pages.
Spectral Clustering
In multivariate statistics and the clustering of data, spectral clustering techniques
make use of the spectrum (eigenvalues) of the similarity matrix of the
data to perform dimensionality reduction before clustering in fewer
dimensions. The similarity matrix is provided as an input and consists
of a quantitative assessment of the relative similarity of each pair of
points in the dataset.
In application to image segmentation, spectral clustering is known as segmentation-based object categorization.
Satnam Sir's session on Cyber Security
Sir discussed the following topics in the session:
Data Exfiltration Detection
Exfiltration is a rather new word in the English language. In fact, it wasn’t used prevalently until recently.
By
definition, data exfiltration is the unauthorized copying, transfer, or
retrieval of data from a computer or server. It is a malicious activity
performed through various different techniques, typically by
cybercriminals over the internet or other networks.
More
specifically, data exfiltration is a security breach that occurs when
one’s data is illegally copied. Normally, it’s the result of a targeted
attack where the malicious actor’s primary intent is to find and copy
specific data from a specific machine. The hacker gains access to the
target machine through a remote application or by directly installing a
portable media device. These breaches often occur on systems that still
use the hardware/software vendor’s default password or an easy-to-guess
password.
Splunk for Detecting and Stopping Data Exfiltration
Then he showed us how he used Splunk to detect unusual activity in the Weblogs.
The User Activity dashboard would display panels representing user activities such as potential data exfiltration. A spike in the volume or a high volume of key indicators such as Non-corporate Web Uploads and Non-corporate Email Activity can
indicate suspicious data transfer. The dashboard indicates a high
volume of suspicious activity involving data being uploaded to
non-corporate domains, as well as suspiciously large email messages sent
to addresses outside the organization.
The User Activity dashboard was used as the starting point to detect suspicious data exfiltration behavior. The Email Activity dashboard
exposed large data transfers to known and unknown domains. Using the
dashboards and searches provided with Splunk Enterprise Security, the
security analyst can check for common data exfiltration behaviors and
set up monitoring of potentially compromised machines and take necessary
remedial action.
Anamoly Detection algorithms
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.
Density-Based Anomaly Detection
Density-based anomaly detection is based on the k-nearest neighbor's algorithm.
Assumption: Normal data points occur around a dense neighborhood and abnormalities are far away.
The
nearest set of data points are evaluated using a score, which could be
Euclidean distance or a similar measure dependent on the type of the
data (categorical or numerical). They could be broadly classified into
two algorithms:
- K-nearest neighbor: k-NN is a simple, non-parametric lazy learning technique used to classify data based on similarities in distance metrics such as Euclidean, Manhattan, Minkowski, or Hamming distance.
- Relative density of data: This is better known as local outlier factor (LOF). This concept is based on a distance metric called reachability distance.
Clustering-Based Anomaly Detection
Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data
points that are similar tend to belong to similar groups or clusters,
as determined by their distance from local centroids.
K-means is
a widely used clustering algorithm. It creates 'k' similar clusters of
data points. Data instances that fall outside of these groups could
potentially be marked as anomalies.
Support Vector Machine-Based Anomaly Detection
A support
vector machine is another effective technique for detecting anomalies.
An SVM is typically associated with supervised learning, but there are
extensions (OneClassCVM, for instance) that can be used to identify
anomalies as an unsupervised problem (in which training data are not
labeled). The algorithm learns a soft boundary in order to cluster the
normal data instances using the training set, and then, using the
testing instance, it tunes itself to identify the abnormalities that
fall outside the learned region.
Depending
on the use case, the output of an anomaly detector could be numeric
scalar values for filtering on domain-specific thresholds or textual
labels (such as binary/multi labels).
He
then asked us to work on a Kaggle dataset of Credit Card Fraud
Detection problem. He asked us to solve it as an anomaly detection
problem rather than a classification problem.
Comments
Post a Comment