Data science for business : what you need to know about data mining and data-analytic thinking / Foster Provost & Tom Fawcett.
Available at World Wide Web
Items
Details
Title
Data science for business : what you need to know about data mining and data-analytic thinking / Foster Provost & Tom Fawcett.
Author
Provost, Foster, 1964-
Added Author
Fawcett, Tom.
Edition
1st ed.
Description
1 online resource (xviii, 384 pages) : illustrations
ISBN
9781449374280 (electronic book)
144937428X (electronic book)
9781449374297 (electronic book)
1449374298 (electronic book)
9781449361327 (paperback)
1449361323 (paperback)
144937428X (electronic book)
9781449374297 (electronic book)
1449374298 (electronic book)
9781449361327 (paperback)
1449361323 (paperback)
Summary
Annotation This broad, deep, but not-too-technical guide introduces you to the fundamental principles of data science and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. By learning data science principles, you will understand the many data-mining techniques in use today. More importantly, these principles underpin the processes and strategies necessary to solve business problems through data mining techniques.
Bibliography, etc. Note
Includes bibliographical references (pages 359-366) and index.
Formatted Contents Note
Machine generated contents note: 1. Introduction: Data-Analytic Thinking
The Ubiquity of Data Opportunities
Example: Hurricane Frances
Example: Predicting Customer Churn
Data Science, Engineering, and Data-Driven Decision Making
Data Processing and "Big Data"
From Big Data 1.0 to Big Data 2.0
Data and Data Science Capability as a Strategic Asset
Data-Analytic Thinking
This Book
Data Mining and Data Science, Revisited
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
Summary
2. Business Problems and Data Science Solutions
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining
From Business Problems to Data Mining Tasks
Supervised Versus Unsupervised Methods
Data Mining and Its Results
The Data Mining Process
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Implications for Managing the Data Science Team
Other Analytics Techniques and Technologies
Statistics
Database Querying
Data Warehousing
Regression Analysis
Machine Learning and Data Mining
Answering Business Questions with These Techniques
Summary
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction
Models, Induction, and Prediction
Supervised Segmentation
Selecting Informative Attributes
Example: Attribute Selection with Information Gain
Supervised Segmentation with Tree-Structured Models
Visualizing Segmentations
Trees as Sets of Rules
Probability Estimation
Example: Addressing the Churn Problem with Tree Induction
Summary
4. Fitting a Model to Data
Fundamental concepts: Finding "optimal" model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines
Classification via Mathematical Functions
Linear Discriminant Functions
Optimizing an Objective Function
An Example of Mining a Linear Discriminant from Data
Linear Discriminant Functions for Scoring and Ranking Instances
Support Vector Machines, Briefly
Regression via Mathematical Functions
Class Probability Estimation and Logistic "Regression"
Logistic Regression: Some Technical Details
Example: Logistic Regression versus Tree Induction
Nonlinear Functions, Support Vector Machines, and Neural Networks
Summary
5. Overfitting and Its Avoidance
Fundamental concepts: Generalization; Fitting and overfitting; Complexity control
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization
Generalization
Overfitting
Overfitting Examined
Holdout Data and Fitting Graphs
Overfitting in Tree Induction
Overfitting in Mathematical Functions
Example: Overfitting Linear Functions
Example: Why Is Overfitting Bad?
From Holdout Evaluation to Cross-Validation
The Churn Dataset Revisited
Learning Curves
Overfitting Avoidance and Complexity Control
Avoiding Overfitting with Tree Induction
A General Method for Avoiding Overfitting
Avoiding Overfitting for Parameter Optimization
Summary
6. Similarity, Neighbors, and Clusters
Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity
Similarity and Distance
Nearest-Neighbor Reasoning
Example: Whiskey Analytics
Nearest Neighbors for Predictive Modeling
How Many Neighbors and How Much Influence?
Geometric Interpretation, Overfitting, and Complexity Control
Issues with Nearest-Neighbor Methods
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes
Other Distance Functions
Combining Functions: Calculating Scores from Neighbors
Clustering
Example: Whiskey Analytics Revisited
Hierarchical Clustering
Nearest Neighbors Revisited: Clustering Around Centroids
Example: Clustering Business News Stories
Understanding the Results of Clustering
Using Supervised Learning to Generate Cluster Descriptions
Stepping Back: Solving a Business Problem Versus Data Exploration
Summary
7. Decision Analytic Thinking I: What Is a Good Model?
Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines
Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison
Evaluating Classifiers
Plain Accuracy and Its Problems
The Confusion Matrix
Problems with Unbalanced Classes
Problems with Unequal Costs and Benefits
Generalizing Beyond Classification
A Key Analytical Framework: Expected Value
Using Expected Value to Frame Classifier Use
Using Expected Value to Frame Classifier Evaluation
Evaluation, Baseline Performance, and Implications for Investments in Data
Summary
8. Visualizing Model Performance
Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results
Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves
Ranking Instead of Classifying
Profit Curves
ROC Graphs and Curves
The Area Under the ROC Curve (AUC)
Cumulative Response and Lift Curves
Example: Performance Analytics for Churn Modeling
Summary
9. Evidence and Probabilities
Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic reasoning via assumptions of conditional independence
Exemplary techniques: Naive Bayes classification; Evidence lift
Example: Targeting Online Consumers With Advertisements
Combining Evidence Probabilistically
Joint Probability and Independence
Bayes' Rule
Applying Bayes' Rule to Data Science
Conditional Independence and Naive Bayes
Advantages and Disadvantages of Naive Bayes
A Model of Evidence "Lift"
Example: Evidence Lifts from Facebook "Likes"
Evidence in Action: Targeting Consumers with Ads
Summary
10. Representing and Mining Text
Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models
Why Text Is Important
Why Text Is Difficult
Representation
Bag of Words
Term Frequency
Measuring Sparseness: Inverse Document Frequency
Combining Them: TFIDF
Example: Jazz Musicians
The Relationship of IDF to Entropy
Beyond Bag of Words
N-gram Sequences
Named Entity Extraction
Topic Models
Example: Mining News Stories to Predict Stock Price Movement
The Task
The Data
Data Preprocessing
Results
Summary
11.
Decision Analytic Thinking II: Toward Analytical Engineering
Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available
Exemplary technique: Expected value as a framework for data science solution design
Targeting the Best Prospects for a Charity Mailing
The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces
A Brief Digression on Selection Bias
Our Churn Example Revisited with Even More Sophistication
The Expected Value Framework: Structuring a More Complicated Business Problem
Assessing the Influence of the Incentive
From an Expected Value Decomposition to a Data Science Solution
Summary
12. Other Data Science Tasks and Techniques
Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science
Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data
Co-occurrences and Associations: Finding Items That Go Together
Measuring Surprise: Lift and Leverage
Example: Beer and Lottery Tickets
Associations Among Facebook Likes
Profiling: Finding Typical Behavior
Link Prediction and Social Recommendation
Data Reduction, Latent Information, and Movie Recommendation
Bias, Variance, and Ensemble Methods
Data-Driven Causal Explanation and a Viral Marketing Example
Summary
13. Data Science and Business Strategy
Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability
Thinking Data-Analytically, Redux
Achieving Competitive Advantage with Data Science
Sustaining Competitive Advantage with Data Science
Formidable Historical Advantage
Unique Intellectual Property
Unique Intangible Collateral Assets
Superior Data Scientists
Superior Data Science Management
Attracting and Nurturing Data Scientists and Their Teams
Examine Data Science Case Studies
Be Ready to Accept Creative Ideas from Any Source
Be Ready to Evaluate Proposals for Data Science Projects
Example Data Mining Proposal.
Note continued: Flaws in the Big Red Proposal
A Firm's Data Science Maturity
14. Conclusion
The Fundamental Concepts of Data Science
Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data
Changing the Way We Think about Solutions to Business Problems
What Data Can't Do: Humans in the Loop, Revisited
Privacy, Ethics, and Mining Data About Individuals
Is There More to Data Science?
Final Example: From Crowd-Sourcing to Cloud-Sourcing
Final Words.
The Ubiquity of Data Opportunities
Example: Hurricane Frances
Example: Predicting Customer Churn
Data Science, Engineering, and Data-Driven Decision Making
Data Processing and "Big Data"
From Big Data 1.0 to Big Data 2.0
Data and Data Science Capability as a Strategic Asset
Data-Analytic Thinking
This Book
Data Mining and Data Science, Revisited
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
Summary
2. Business Problems and Data Science Solutions
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining
From Business Problems to Data Mining Tasks
Supervised Versus Unsupervised Methods
Data Mining and Its Results
The Data Mining Process
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Implications for Managing the Data Science Team
Other Analytics Techniques and Technologies
Statistics
Database Querying
Data Warehousing
Regression Analysis
Machine Learning and Data Mining
Answering Business Questions with These Techniques
Summary
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction
Models, Induction, and Prediction
Supervised Segmentation
Selecting Informative Attributes
Example: Attribute Selection with Information Gain
Supervised Segmentation with Tree-Structured Models
Visualizing Segmentations
Trees as Sets of Rules
Probability Estimation
Example: Addressing the Churn Problem with Tree Induction
Summary
4. Fitting a Model to Data
Fundamental concepts: Finding "optimal" model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines
Classification via Mathematical Functions
Linear Discriminant Functions
Optimizing an Objective Function
An Example of Mining a Linear Discriminant from Data
Linear Discriminant Functions for Scoring and Ranking Instances
Support Vector Machines, Briefly
Regression via Mathematical Functions
Class Probability Estimation and Logistic "Regression"
Logistic Regression: Some Technical Details
Example: Logistic Regression versus Tree Induction
Nonlinear Functions, Support Vector Machines, and Neural Networks
Summary
5. Overfitting and Its Avoidance
Fundamental concepts: Generalization; Fitting and overfitting; Complexity control
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization
Generalization
Overfitting
Overfitting Examined
Holdout Data and Fitting Graphs
Overfitting in Tree Induction
Overfitting in Mathematical Functions
Example: Overfitting Linear Functions
Example: Why Is Overfitting Bad?
From Holdout Evaluation to Cross-Validation
The Churn Dataset Revisited
Learning Curves
Overfitting Avoidance and Complexity Control
Avoiding Overfitting with Tree Induction
A General Method for Avoiding Overfitting
Avoiding Overfitting for Parameter Optimization
Summary
6. Similarity, Neighbors, and Clusters
Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity
Similarity and Distance
Nearest-Neighbor Reasoning
Example: Whiskey Analytics
Nearest Neighbors for Predictive Modeling
How Many Neighbors and How Much Influence?
Geometric Interpretation, Overfitting, and Complexity Control
Issues with Nearest-Neighbor Methods
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes
Other Distance Functions
Combining Functions: Calculating Scores from Neighbors
Clustering
Example: Whiskey Analytics Revisited
Hierarchical Clustering
Nearest Neighbors Revisited: Clustering Around Centroids
Example: Clustering Business News Stories
Understanding the Results of Clustering
Using Supervised Learning to Generate Cluster Descriptions
Stepping Back: Solving a Business Problem Versus Data Exploration
Summary
7. Decision Analytic Thinking I: What Is a Good Model?
Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines
Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison
Evaluating Classifiers
Plain Accuracy and Its Problems
The Confusion Matrix
Problems with Unbalanced Classes
Problems with Unequal Costs and Benefits
Generalizing Beyond Classification
A Key Analytical Framework: Expected Value
Using Expected Value to Frame Classifier Use
Using Expected Value to Frame Classifier Evaluation
Evaluation, Baseline Performance, and Implications for Investments in Data
Summary
8. Visualizing Model Performance
Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results
Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves
Ranking Instead of Classifying
Profit Curves
ROC Graphs and Curves
The Area Under the ROC Curve (AUC)
Cumulative Response and Lift Curves
Example: Performance Analytics for Churn Modeling
Summary
9. Evidence and Probabilities
Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic reasoning via assumptions of conditional independence
Exemplary techniques: Naive Bayes classification; Evidence lift
Example: Targeting Online Consumers With Advertisements
Combining Evidence Probabilistically
Joint Probability and Independence
Bayes' Rule
Applying Bayes' Rule to Data Science
Conditional Independence and Naive Bayes
Advantages and Disadvantages of Naive Bayes
A Model of Evidence "Lift"
Example: Evidence Lifts from Facebook "Likes"
Evidence in Action: Targeting Consumers with Ads
Summary
10. Representing and Mining Text
Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models
Why Text Is Important
Why Text Is Difficult
Representation
Bag of Words
Term Frequency
Measuring Sparseness: Inverse Document Frequency
Combining Them: TFIDF
Example: Jazz Musicians
The Relationship of IDF to Entropy
Beyond Bag of Words
N-gram Sequences
Named Entity Extraction
Topic Models
Example: Mining News Stories to Predict Stock Price Movement
The Task
The Data
Data Preprocessing
Results
Summary
11.
Decision Analytic Thinking II: Toward Analytical Engineering
Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available
Exemplary technique: Expected value as a framework for data science solution design
Targeting the Best Prospects for a Charity Mailing
The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces
A Brief Digression on Selection Bias
Our Churn Example Revisited with Even More Sophistication
The Expected Value Framework: Structuring a More Complicated Business Problem
Assessing the Influence of the Incentive
From an Expected Value Decomposition to a Data Science Solution
Summary
12. Other Data Science Tasks and Techniques
Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science
Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data
Co-occurrences and Associations: Finding Items That Go Together
Measuring Surprise: Lift and Leverage
Example: Beer and Lottery Tickets
Associations Among Facebook Likes
Profiling: Finding Typical Behavior
Link Prediction and Social Recommendation
Data Reduction, Latent Information, and Movie Recommendation
Bias, Variance, and Ensemble Methods
Data-Driven Causal Explanation and a Viral Marketing Example
Summary
13. Data Science and Business Strategy
Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability
Thinking Data-Analytically, Redux
Achieving Competitive Advantage with Data Science
Sustaining Competitive Advantage with Data Science
Formidable Historical Advantage
Unique Intellectual Property
Unique Intangible Collateral Assets
Superior Data Scientists
Superior Data Science Management
Attracting and Nurturing Data Scientists and Their Teams
Examine Data Science Case Studies
Be Ready to Accept Creative Ideas from Any Source
Be Ready to Evaluate Proposals for Data Science Projects
Example Data Mining Proposal.
Note continued: Flaws in the Big Red Proposal
A Firm's Data Science Maturity
14. Conclusion
The Fundamental Concepts of Data Science
Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data
Changing the Way We Think about Solutions to Business Problems
What Data Can't Do: Humans in the Loop, Revisited
Privacy, Ethics, and Mining Data About Individuals
Is There More to Data Science?
Final Example: From Crowd-Sourcing to Cloud-Sourcing
Final Words.
Digital File Characteristics
text file
Source of Description
Print version record.
Location
www
Available in Other Form
Print version: Provost, Foster, 1964- Data science for business. Sebastopol, Calif. : O'Reilly, 2013
Linked Resources
Access provided by Berkeley Law Library
Published
Sebastopol, CA : O'Reilly Media, 2013.
Language
English
Record Appears in
Monographs & Serials
Electronic Resources
Electronic Resources