100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Big data management & Analytics. Grade: 8.8 $5.72   Add to cart

Summary

Summary Big data management & Analytics. Grade: 8.8

2 reviews
 149 views  10 purchases
  • Course
  • Institution
  • Book

Summary of the course BDMA. Grade achieved: 8.8

Preview 4 out of 84  pages

  • Yes
  • December 7, 2020
  • 84
  • 2019/2020
  • Summary

2  reviews

review-writer-avatar

By: ravdeepksingh • 10 months ago

review-writer-avatar

By: felienkarsten • 2 year ago

avatar-seller
Summary Big Data Management and
Analytics
Book Data Science
Chapter 1
Data science involves principles, processes, and techniques for understanding phenomena via the
(automated) analysis of data. The ultimate goal is improving decision making.

Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data,
rather than purely on intuition. There are two sorts of decisions:

(1) Decisions for which “discoveries” need to be made within data
(2) Decisions that repeat, especially at massive scale, and so decision-making can benefit from
even small increases in decision-making accuracy based on data analysis.




There is a lot to data processing that is not data science—despite the impression one might get from
the media. Data engineering and processing are critical to support data science, but they are more
general.

 Data science needs access to data and it often benefits from sophisticated data engineering
that data processing technologies may facilitate, but these technologies are not data science
technologies per se.
 Data processing technologies are very important for many data-oriented business tasks that
do not involve extracting knowledge or data-driven decision-making, such as efficient
transaction processing, modern web system processing, and online advertising campaign
management.

Big data essentially means datasets that are too large for traditional data processing systems, and
therefore require new processing technologies. Used for:

 Data engineering
 Data mining
 But, most often: Data processing in support of data mining techniques and other data science
activities

1

,A fundamental strategy of data science is to acquire the necessary data at a cost. Once we view data
as a business asset, we should think about whether and how much we are willing to invest.

Four fundamental concepts of data science:

1. Extracting useful knowledge from data to solve business problems can be treated
systematically by following a process with reasonably well-defined stages.
2. From a large mass of data, information technology can be used to find informative
descriptive attributes of entities of interest.
3. If you look too hard at a set of data, you will find something—but it might not generalize
beyond the data you’re looking at.
4. Formulating data mining solutions and evaluating the results involves thinking carefully
about the context in which they will be used.

Chapter 2
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised
versus unsupervised data mining.

An important principle of data science is that data mining is a process with fairly wellunderstood
stages.

Examples of data mining algorithm tasks:

1. Classification and class probability estimation attempt to predict, for each individual in a
population, which of a (small) set of classes this individual belongs to. (E.g. “Among all the
customers of MegaTelCo, which are likely to respond to a given offer?”) In this example the
two classes could be called will respond and will not respond.
a. A closely related task is scoring or class probability estimation. A scoring model
applied to an individual produces, instead of a class prediction, a score representing
the probability that that individual belongs to each class.
2. Regression: (“value estimation”) attempts to estimate or predict, for each individual, the
numerical value of some variable for that individual. An example regression question would
be: “How much will a given customer use the service?”
a. Regression is related to classification, but the two are different. Informally,
classification predicts whether something will happen, whereas regression predicts
how much something will happen.
3. Similarity matching: attempts to identify similar individuals based on data known about
them. Similarity matching can be used directly to find similar entities. For example, IBM is
interested in finding companies similar to their best business customers, in order to focus
their sales force on the best opportunities.
4. Clustering: attempts to group individuals in a population together by their similarity, but not
driven by any specific purpose. An example clustering question would be: “Do our customers
form natural groups or segments?”


Supervised versus unsupervised methods: A vital part in the early stages of the data mining process
is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to
produce a precise definition of a target variable.

 Consider two similar questions we might ask about a customer population. The first is: “Do
our customers naturally fall into different groups?” Here no specific purpose or target has


2

, been specified for the grouping. When there is no such target, the data mining problem is
referred to as unsupervised.
o Example: Clustering
 Contrast this with a slightly different question: “Can we find groups of customers who have
particularly high likelihoods of canceling their service soon after their contracts expire?” Here
there is a specific target defined: will a customer leave when her contract expires? In this
case, segmentation is being done for a specific reason. This is called a supervised data mining
problem.
o Examples: Classification & Regression.

Cross Industry Standard Process for Data Mining




This process diagram makes explicit the fact that iteration is the rule rather than the exception.
Going through the process once without having solved the problem is, generally speaking, not a
failure.

Business Understanding

Initially, it is vital to understand the problem to be solved. This may seem obvious, but business
projects seldom come pre-packaged as clear and unambiguous data mining problems. Often
recasting the problem and designing a solution is an iterative process of discovery. The process
model represents this as cycles within a cycle, rather than as a simple linear process. The initial
formulation may not be complete or optimal so multiple iterations may be necessary for an
acceptable solution formulation to appear. In this first stage, the design team should think carefully
about the use scenario – What exactly do we want to do?

Data Understanding

If solving the business problem is the goal, the data comprise the available raw material from which
the solution will be built. It is important to understand the strengths and limitations of the data
because rarely is there an exact match with the problem. A critical part of the data understanding
phase is estimating the costs and benefits of each data source and deciding whether further
investment is merited. In data understanding we need to dig beneath the surface to uncover the

3

, structure of the business problem and the data that are available, and then match them to one or
more data mining task.

Data Preparation

A data preparation phase often proceeds along with data understanding, in which the data are
manipulated and converted into forms that yield better results. Typical examples of data preparation
are converting data to tabular format, removing or inferring missing values, and converting data to
different types.

Modeling

The output of modeling is some sort of model or pattern capturing regularities in the data. The
modeling stage is the primary place where data mining techniques are applied to the data.

Evaluation

The purpose of the evaluation stage is to assess the data mining results rigorously and to gain
confidence that they are valid and reliable before moving on. Equally important, the evaluation stage
also serves to help ensure that the model satisfies the original business goals. Recall that the primary
goal of data science for business is to support decision making.

A model may be extremely accurate (> 99%) by laboratory standards, but evaluation in the actual
business context may reveal that it still produces too many false alarms to be economically feasible.

Deployment

In deployment the results of data mining—and increasingly the data mining techniques themselves—
are put into real use in order to realize some return on investment. The clearest cases of deployment
involve implementing a predictive model in some information system or business process.

The main difference between data mining and other analytics techniques is that data mining focuses
on the automated search for knowledge, patterns, or regularities from data.

Chapter 3
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute
selection.

Supervised segmentation: how can we segment the population into groups that differ from each
other with respect to some quantity of interest.

 One of the fundamental ideas of data mining: finding or selecting important, informative
variables or “attributes” of the entities described by the data.
o Information is a quantity that reduces uncertainty about something.
 Finding informative attributes also is the basis for a widely used predictive modeling
technique called tree induction. Tree induction incorporates the idea of supervised
segmentation in an elegant manner, repeatedly selecting informative attributes.

Supervised data mining can be divided into classification and regression.

Supervised learning is model creation where the model describes a relationship between a set of
selected variables (attributes or features) and a predefined variable called the target variable. The
model estimates the value of the target variable as a function (possibly a probabilistic function) of
the features.


4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller jeremyut. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.72. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

74735 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.72  10x  sold
  • (2)
  Add to cart