Annotation This broad, deep, but not-too-technical guide introduces you to the fundamental principles of data science and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. By learning data science principles, you will understand the many data-mining techniques in use today. More importantly, these principles underpin the processes and strategies necessary to solve business problems through data mining techniques.
Bibliography, etc. Note
Includes bibliographical references (pages 359-366) and index.
Formatted Contents Note
Machine generated contents note: 1. Introduction: Data-Analytic Thinking The Ubiquity of Data Opportunities Example: Hurricane Frances Example: Predicting Customer Churn Data Science, Engineering, and Data-Driven Decision Making Data Processing and "Big Data" From Big Data 1.0 to Big Data 2.0 Data and Data Science Capability as a Strategic Asset Data-Analytic Thinking This Book Data Mining and Data Science, Revisited Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist Summary 2. Business Problems and Data Science Solutions Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining From Business Problems to Data Mining Tasks Supervised Versus Unsupervised Methods Data Mining and Its Results The Data Mining Process Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Implications for Managing the Data Science Team Other Analytics Techniques and Technologies Statistics Database Querying Data Warehousing Regression Analysis Machine Learning and Data Mining Answering Business Questions with These Techniques Summary 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction Models, Induction, and Prediction Supervised Segmentation Selecting Informative Attributes Example: Attribute Selection with Information Gain Supervised Segmentation with Tree-Structured Models Visualizing Segmentations Trees as Sets of Rules Probability Estimation Example: Addressing the Churn Problem with Tree Induction Summary 4. Fitting a Model to Data Fundamental concepts: Finding "optimal" model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions Exemplary techniques: Linear regression; Logistic regression; Support-vector machines Classification via Mathematical Functions Linear Discriminant Functions Optimizing an Objective Function An Example of Mining a Linear Discriminant from Data Linear Discriminant Functions for Scoring and Ranking Instances Support Vector Machines, Briefly Regression via Mathematical Functions Class Probability Estimation and Logistic "Regression" Logistic Regression: Some Technical Details Example: Logistic Regression versus Tree Induction Nonlinear Functions, Support Vector Machines, and Neural Networks Summary 5. Overfitting and Its Avoidance Fundamental concepts: Generalization; Fitting and overfitting; Complexity control Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization Generalization Overfitting Overfitting Examined Holdout Data and Fitting Graphs Overfitting in Tree Induction Overfitting in Mathematical Functions Example: Overfitting Linear Functions Example: Why Is Overfitting Bad? From Holdout Evaluation to Cross-Validation The Churn Dataset Revisited Learning Curves Overfitting Avoidance and Complexity Control Avoiding Overfitting with Tree Induction A General Method for Avoiding Overfitting Avoiding Overfitting for Parameter Optimization Summary 6. Similarity, Neighbors, and Clusters Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity Similarity and Distance Nearest-Neighbor Reasoning Example: Whiskey Analytics Nearest Neighbors for Predictive Modeling How Many Neighbors and How Much Influence? Geometric Interpretation, Overfitting, and Complexity Control Issues with Nearest-Neighbor Methods Some Important Technical Details Relating to Similarities and Neighbors Heterogeneous Attributes Other Distance Functions Combining Functions: Calculating Scores from Neighbors Clustering Example: Whiskey Analytics Revisited Hierarchical Clustering Nearest Neighbors Revisited: Clustering Around Centroids Example: Clustering Business News Stories Understanding the Results of Clustering Using Supervised Learning to Generate Cluster Descriptions Stepping Back: Solving a Business Problem Versus Data Exploration Summary 7. Decision Analytic Thinking I: What Is a Good Model? Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison Evaluating Classifiers Plain Accuracy and Its Problems The Confusion Matrix Problems with Unbalanced Classes Problems with Unequal Costs and Benefits Generalizing Beyond Classification A Key Analytical Framework: Expected Value Using Expected Value to Frame Classifier Use Using Expected Value to Frame Classifier Evaluation Evaluation, Baseline Performance, and Implications for Investments in Data Summary 8. Visualizing Model Performance Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves Ranking Instead of Classifying Profit Curves ROC Graphs and Curves The Area Under the ROC Curve (AUC) Cumulative Response and Lift Curves Example: Performance Analytics for Churn Modeling Summary 9. Evidence and Probabilities Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic reasoning via assumptions of conditional independence Exemplary techniques: Naive Bayes classification; Evidence lift Example: Targeting Online Consumers With Advertisements Combining Evidence Probabilistically Joint Probability and Independence Bayes' Rule Applying Bayes' Rule to Data Science Conditional Independence and Naive Bayes Advantages and Disadvantages of Naive Bayes A Model of Evidence "Lift" Example: Evidence Lifts from Facebook "Likes" Evidence in Action: Targeting Consumers with Ads Summary 10. Representing and Mining Text Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models Why Text Is Important Why Text Is Difficult Representation Bag of Words Term Frequency Measuring Sparseness: Inverse Document Frequency Combining Them: TFIDF Example: Jazz Musicians The Relationship of IDF to Entropy Beyond Bag of Words N-gram Sequences Named Entity Extraction Topic Models Example: Mining News Stories to Predict Stock Price Movement The Task The Data Data Preprocessing Results Summary 11. Decision Analytic Thinking II: Toward Analytical Engineering Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available Exemplary technique: Expected value as a framework for data science solution design Targeting the Best Prospects for a Charity Mailing The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces A Brief Digression on Selection Bias Our Churn Example Revisited with Even More Sophistication The Expected Value Framework: Structuring a More Complicated Business Problem Assessing the Influence of the Incentive From an Expected Value Decomposition to a Data Science Solution Summary 12. Other Data Science Tasks and Techniques Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data Co-occurrences and Associations: Finding Items That Go Together Measuring Surprise: Lift and Leverage Example: Beer and Lottery Tickets Associations Among Facebook Likes Profiling: Finding Typical Behavior Link Prediction and Social Recommendation Data Reduction, Latent Information, and Movie Recommendation Bias, Variance, and Ensemble Methods Data-Driven Causal Explanation and a Viral Marketing Example Summary 13. Data Science and Business Strategy Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability Thinking Data-Analytically, Redux Achieving Competitive Advantage with Data Science Sustaining Competitive Advantage with Data Science Formidable Historical Advantage Unique Intellectual Property Unique Intangible Collateral Assets Superior Data Scientists Superior Data Science Management Attracting and Nurturing Data Scientists and Their Teams Examine Data Science Case Studies Be Ready to Accept Creative Ideas from Any Source Be Ready to Evaluate Proposals for Data Science Projects Example Data Mining Proposal. Note continued: Flaws in the Big Red Proposal A Firm's Data Science Maturity 14. Conclusion The Fundamental Concepts of Data Science Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data Changing the Way We Think about Solutions to Business Problems What Data Can't Do: Humans in the Loop, Revisited Privacy, Ethics, and Mining Data About Individuals Is There More to Data Science? Final Example: From Crowd-Sourcing to Cloud-Sourcing Final Words.
Digital File Characteristics
Source of Description
Print version record.
Available in Other Form
Print version: Provost, Foster, 1964- Data science for business. Sebastopol, Calif. : O'Reilly, 2013