Mastering Data Analysis in Excel Quiz Answer

Mastering Data Analysis in Excel . This post is about Mastering Data Analysis in Excel Quiz Answer | 100% Correct Answer Of Week (1-6).

Mastering Data Analysis in Excel

N.B. We attempted our best to keep this site refreshed for our clients for nothing. You can likewise contribute by refreshing new inquiries or existing inquiry answer(s). There are numerous inquiries on our site, it is difficult for us to check them consistently. It will be extraordinary on the off chance that you can assist us with updating the site. Just let us know if you find any new questions of Mastering Data Analysis in Excel through mail or comment . We will attempt to refresh the inquiry/answer ASAP.

Use “Ctrl+F” To Find Any Questions Answer. & For Mobile User You Just Need To Click On Three dots In Your Browser & You Will Get A “Find” Option There. Use These Option to Get Any Random Questions Answer.

To get Mastering Data Analysis in Excel Quiz Answer | Week (1-3), Please click bellow:

Week- 1

Click Here To View Answers

Week- 2

Click Here To View Answers

Week-3

Click Here To View Answers

Mastering Data Analysis in Excel Quiz Answer

Week- 4

Parametric Models for Regression (graded)

1. A University admissions test has a Gaussian distribution of test scores with a mean of 500 and standard deviation of 100. One student out-performed 97.4% of all test takers.

What was their test score (rounded to the nearest whole number)?

Hint: Refer to the Excel NormSFunctions Spreadsheet.

Excel NormS Functions Spreadsheet.xlsx

694
502

2. A carefully machined wire comes off an assembly line within a certain tolerance. Its diameter is 100 microns, and all the wires produced have a uniform distribution of error, between -11 microns and +29 microns.

A testing machine repeatedly draws samples of 180 wires and measures the sample mean. What is the distribution of sample means?

Hint: Use the CLT and Excel Rand() Spreadsheet.

CLT and Excel Rand.xlsx

A Uniform Distribution with mean = 109 microns and standard deviation = .8607 microns.
A Uniform Distribution with mean = 109 microns and standard deviation = 11.54 microns.
A Gaussian distribution that, in Phi notation, is written, ϕ(109, 133.33).

A Gaussian Distribution that, in Phi notation, is written ϕ(109, .7407).

3. A population of people suffering from Tachycardia (occasional rapid heart rate), agrees to test a new medicine that is supposed to lower heart rate. In the population being studied, before taking any medicine the mean heart rate was 120 beats per minute, with standard deviation = 15 beats per minute.

After being given the medicine, a sample of 45 people had an average heart rate of 112 beats per minute. What is the probability that this much variation from the mean could have occurred by chance alone?

Hint: Use the Typical Problem with NormSDist Spreadsheet.

Typical Problem_ NormSDist .xlsx

.0173%

99.9827%
1.73%
29.690%

4. Two stocks have the following expected annual returns:

Oil stock – expected return = 9% with standard deviation = 13%

IT stock – expected return = 14% with standard deviation = 25%

The Stocks prices have a small negative correlation: R = -.22.

What is the Covariance of the two stocks?

Hint: Use the Algebra with Gaussians Spreadsheet.

Algebra with Gaussians.xlsx

-.0286
-.00715

-.00573
-.00219

5. Two stocks have the following expected annual returns:

Oil stock – expected return = 9% with standard deviation = 13%

IT stock – expected return = 14% with standard deviation = 25%

The Stocks prices have a small negative correlation: R = -.22.

Assume return data for the two stocks is standardized so that each is represented as having mean 0 and standard deviation 1. Oil is plotted against IT on the (x,y) axis.

What is the covariance?

Hint: Use the Standardization Spreadsheet.

Standardization Spreadsheet.xlsx

-.22
-.00573

0
-1

6. Two stocks have the following expected annual returns:

Oil stock – expected return = 9% with standard deviation = 13%

IT stock – expected return = 14% with standard deviation = 25%

The Stocks prices have a small negative correlation: R = -.22.

What is the standard deviation of a portfolio consisting of 70% Oil and 30% IT?

Hint: Use either the Algebra with Gaussians or the Markowitz Portfolio Optimization Spreadsheet.

Algebra with Gaussians.xlsx
Markowitz Portfolio Optimization.xlsx

12.68%
10.44%
17.93%

11.79%

7. Two stocks have the following expected annual returns:

Oil stock – expected return = 9% with standard deviation = 13%

IT stock – expected return = 14% with standard deviation = 25%

The Stocks prices have a small negative correlation: R = -.22.

Use MS Solver and the Markowitz Portfolio Optimization Spreadsheet to Find the weighted portfolio of the two stocks with lowest volatility.

Solver Add-In.xlsx
Markowitz Portfolio Optimization.xlsx
What is the minimum volatility?

10.43%

9.5%
10.36%
11.58%

8. You are a data-analyst for a restaurant chain and are asked to forecast first-year revenues from new store locations. You use census tract data to develop a linear model.

Your first model has a standard deviation of model error of $25,000 at a correlation of R = .30. Your boss asks you to keep working on improving the model until the new standard deviation of model error is $15,000 or less.

What positive correlation R would you need to have a model error of $15,000?

(Note: you can answer this question by making small additions to the Correlation and Model Error spreadsheet).

Correlation and Model Error.xlsx

R = .428

R = .8200
R = .500
R = .572

9. An automobile parts manufacturer uses a linear regression model to forecast the dollar value of the next years’ orders from current customers as a function of a weighted sum of their past-years’ orders. The model error is assumed Gaussian with standard deviation of $130,000.

If the correlation is R = .33, and the point forecast orders $5.1 million, what is the probability that the customer will order more than $5.3 million?

Hint: Use the Typical Problem with NormSDist Spreadsheet.

Typical Problem_ NormSDist .xlsx

93.8%
4.3%

6.2%
12.4%

10. An automobile parts manufacturer uses a linear regression model to forecast the dollar value of the next years’ orders from current customers as a function of a weighted sum of that customer’s past-years orders. The linear correlation is R = .33.

After standardizing the x and y data, what portion of the uncertainty about a customer’s order size is eliminated by their historical data combined with the model?

Hint: Use the Correlation and P.I.G. Spreadsheet.

Correlation and P.I.G..xlsx

4.2%
3.5%
4.5%

5.2%

11. A restaurant offers different dinner “specials” each weeknight. The mean cash register receipt per table on Wednesdays is $75.25 with standard deviation of $13.50. The restaurant experiments one Wednesday with changing the “special” from blue fish to lobster. The average amount spent by 85 customers is $77.20.

How probable is it that Wednesday receipts are better than average by chance alone?

Hint: Use the Typical Problem with NormSDist Spreadsheet.

Typical Problem_ NormSDist .xlsx

9.15%

9.05%
90.85%
8.30%

12. Your company currently has no way to predict how long visitors will spend on the Company’s web site. All it known is the average time spent is 55 seconds, with an approximately Gaussian distribution and standard deviation of 9 seconds. It would be possible, after investing some time and money in analytics tools, to gather and analyzing information about visitors and build a linear predictive model with a standard deviation of model error of 4 seconds.

How much would the P.I. G. of that model be?

Hint: Use the Correlation and P.I.G. Spreadsheet

How to use the AUC calculator.pdf
PDF File

48.2%
61.5%

53.3%
57.2%

Mastering Data Analysis in Excel Quiz Answer

Week- 5

Probability, AUC, and Excel Linest Function

1. Keep the 125 outcomes in the Histogram Spreadsheet unchanged. Change the bin ranges so that bin 1 is [-3, -1), bin 2 is [-1,1) bin 3 is [1, 3).

Histograms Spreadsheet.xlsx
What is the approximate probability that a new outcome will fall within bin 1?

2. Use the Excel Probability Functions Spreadsheet.

Excel_Probability_Functions.xlsx
Assume a continuous uniform probability distribution over the range [47, 51.5].

What is the skewness of the probability distribution?

49.25
1.69

2.17
0

3. Use the Excel Probability FunctionsSpreadsheet, provided in question #2.

Assume a continuous uniform probability distribution over the range [-12, 20]

What is the entropy of this distribution?

5 bits
3 bits
6 bits

4 bits

4. Use the Excel Probability Functions Spreadsheet that was previously provided.

Assume a Gaussian Probability function with mean = 3 and

standard deviation =4.

What is the value of f(x) at f(3.5)?

4.05
.352
.099

.550

5. Use the Excel Probability Functions Spreadsheet previously provided in this quiz.

Assume a Gaussian Probability Distribution with mean = 3 and standard deviation = 4.

What is the cumulative distribution at x = 7?

.960
.841
.060
1.00

6. Use the AUC Calculator Spreadsheet.

AUC_Calculator and Review of AUC Curve.xlsx
If the “modification factor” in the original example given in the AUC Calculator Spreadsheet is changed from -1 to -2, what is the change in the actual Area Under the ROC Curve?

No change
The area increases
The area decreases

7. Use the AUC Calculator Spreadsheet provided in question #6.

If the “modification factor” in the original example given in the AUC Calculator Spreadsheet is changed from -1 to -2, what is the threshold (row 10) that results in the lowest cost per event?

.45
3.5
.9
1.3

8. Refer to the AUC Calculator Spreadsheet previously provided.

Assume a binary classification model is trained on 200 ordered pairs of scores and outcomes and has an AUC of .91 on this “training set.” The same model, on 5,000 new scores and outcomes, has an AUC of .5.

Which statement is most likely to be correct?

The model overfit the training set data and will need to be improved to work better on the new data.
The original model is expected to perform worse on test set data and is functioning acceptably.
The original model identified signal as noise and has no predictive value on new data.

9. Refer to the Excel Linest Function Spreadsheet.

Excel Linest Function.xlsx
If a multivariate linear regression gives a weight beta(1) of 0.4 on x(1) = “age in years,” and a new input x(7) of “age in months” is added to the regression data, which of the following statements is false?

If the x(1) data are removed, the new beta(7) on the new x(7) data will be .033
Using Excel linest, and including x(1) and x(7) data, the new beta(7) on the age in months will be 0.
If the x(1) data are removed, the new beta(7) on the new x(7) data will be 0.4.

10. Use the Excel Linest Function Spreadsheet that was provided in question #9.

What is the Correlation, R for the linear regression shown in the example?

.367
.606
.778 or – .778

Mastering Data Analysis in Excel Quiz Answer

Week- 6

Part 1: Building your Own Binary Classification Model

1. First Binary Classification Model

Data_Final Project.xlsx
You work for a bank as a business data analyst in the credit card risk-modeling department. Your bank conducted a bold experiment three years ago: for a single day it quietly issued credit cards to everyone who applied, regardless of their credit risk, until the bank had issued 600 cards without screening applicants.

After three years, 150, or 25%, of those card recipients defaulted: they failed to pay back at least some of the money they owed. However, the bank collected very valuable proprietary data that it can now use to optimize its future card-issuing process.

The bank initially collected six pieces of data about each person:

· Age

· Years at current employer

· Years at current address

· Income over the past year

· Current credit card debt, and

· Current automobile debt

In addition, the bank now has a binary outcome: default = 1, and no default = 0.

Your first assignment is to analyze the data and create a binary classification model to forecast future defaults.

You will combine data from the above six inputs to output a single “score.” Use the Soldier Performance spreadsheet for a simple example of combining multiple inputs.

Forecasting Soldier Performance.xlsx
The relative rank-ordering of scores will determine the model’s effectiveness. For convenience– in particular, so that you can use the AUC Calculator Spreadsheet–you are asked to use a scale for your score that has a maximum < 3.5 and a minimum > -3.5.

At first you are not told what your bank’s own best estimate for its cost per False Negative (accepted applicant who becomes a defaulting customer) and False Positive (rejected customer who would not have defaulted) classification.

Therefore, the best you can do is to design your model to maximize the Area Under the ROC Curve, or AUC.

You are told that if your model is effective (“high enough” AUC, not defined further) and “robust” (again not defined, but in general this means relatively little decrease in AUC across multiple sets of new data) then it may be adopted by the bank as its predictive model for default, to determine which future applicants will be issued credit cards.

You are first given a “Training Set” of 200 out of the 600 people in the experiment. The Data_For_Final_Project (below) has both the training set and test set you will need.

Design your model using the Training Set. Standardized versions of the input data also provided for your convenience. You may combine the six inputs by adding them to, or subtracting them from, each other, taking simple ratios, etc. Exclude inputs that are not helpful and then experiment with how to combine the most informative inputs.

Note that will need some of your quiz answers again later, so please write them down and keep track of them as you go along.

Question: What is your model? Give it as a function of the two or more of the six inputs. For example: (Age + Years at Current Address)/Income [not a great model!].

Your model should have at least two inputs.

What do you think?
Your answer cannot be more than 10000 characters.

2. What is your model’s AUC on the Training Set? Use two digits to the right of the decimal place.

Enter answer here

.70

3. Initial Assessment for Over-fitting (testing your model on new data)

Next test your model, without changing any parameters, on the Test Set of 200 additional applicants. See the Test Set spreadsheet. It is part of the Data_For_Final_Project (below) and has both the training and test set.

Data_Final Project.xlsx
Hint: Make and use a second copy of the AUC Calculator Spreadsheet so that you can compare Test Set and Training Set results easily.

AUC_Calculator and Review of AUC Curve.xlsx
What is your model’s new AUC on the Test Set? Give two digits to the right of the decimal place.

Enter answer here

0.80

4. Finding the Cost-Minimizing Threshold for your Model

Now that you have, hopefully, developed your model to the point where it is relatively “robust” across the training set and test set, your boss at the bank finally gives you its current rough estimate of the bank’s average costs for each type of classification error.

[Note that all bank models here include only profits and losses within three years of when a card is issued, so the impact of out-years (years beyond 3) can be ignored.]

Cost Per False Negative: $5000

Cost Per False Positive: $2500

For the 600 individuals that were automatically given cards without being classified, the total cost of the experiment turned out to be 25%*($5000)*600 or $750,000. This is $1,250 per event.

Only models with lower cost per event than $1,250 should have any value.

Question: What is the threshold score on the Training Set data for your model that minimizes Cost per Event? You will need this number to answer later questions.

Hint: Using theAUC Calculator Spreadsheet, identify which Column displays the same cost-per-event (row 17) as the overall minimum cost-per-event shown in Cell J2. The threshold is shown in row 10 of that Column. What the threshold means is that at and above this number everything is classified as a “default.”

AUC_Calculator and Review of AUC Curve.xlsx

Enter answer here

3.5

5. Finding the Minimum Cost Per Event

Question: Again referring only to the Training Set data, what is the overall minimum cost-per-event?

Hint: You will need this number to answer later questions. If you used the AUC Calculator, the overall minimum cost per event will be displayed in Cell J2.

Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.

For Example – enter $800.00 as “800”

Enter answer here

600

6. Comparing the New Minimum Cost Per Event on Test Set Data

When you compared AUC for the Training and Test Sets, all that is necessary is to look up the two different values in Cell G8. But to get an accurate measure of the cost-savings using the original model on new data, you can not automatically use the new threshold that results in the overall lowest cost-per-event on the Test Set.

Remember that your model is being tested for its ability to forecast – but the new optimal threshold will be known only after the outcomes for the entire Test Set are known.

All you can use is the model you developed on the Training Set data and the threshold from the Training Set that you should have recorded when answering Question 4.

Question: At that same threshold score (NOT the threshold score that would minimize costs for the new Test Set, but the “old” threshold score that minimized costs on the Training Set) what is the cost per event on the test set?

Hint: Using the AUC Calculator Spreadsheet previously provided, locate the column on the Training Set data that has the lowest-cost-per event. That same column and threshold in the Test Set copy of the AUC Calculator will have a new cost-per-event, displayed in row 17. This is almost always higher than the minimum cost-per-event on the Training Set, and also higher than what the minimal cost-per-event would be on the Test Set, if one could know the new optimal threshold in advance. This number is the actual cost per event when applying the model-and-threshold developed with the Training Set to the new, Test Set data.

Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.

For Example – enter $800.00 as “800”

Enter answer here

700.00

7. Putting a Dollar Value on Your Model Plus the Data

Assume your Test Set cost-per-event results from Question 6 are sustainable long term.

Question: How much money does the bank save, per event, using your model and its data-inputs, instead of issuing credit cards to everyone who asks?

Hint: the cost of issuing credit cards to everyone (no model, no forecast) has been determined to be 25%*$5000 = $1,250 per event. Dollar value of the model-plus-data is the difference between $1,250 and your number.

Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.

For Example – enter $800.00 as “800”

Enter answer here

100

8. Payback Period for Your Model

Question: Given that it apparently cost the bank $750,000 to conduct the three-year experiment, if the bank processes 1000 credit card applicants per day on average, how many days will it take to ensure future savings will pay back the bank’s initial investment?

Give number rounded to the nearest day (integer value).

Hint: multiply your answer to Question 7 – the cost savings per applicant – by 1000 to get the savings per day.

Enter answer here

9. Any model that is reducing uncertainty will have a True Positive Rate…

…Less than the Test Incidence (% of outcomes classified as “default”)
…Equal to the Test Incidence (% of outcomes classified as “default”)
…Greater than the Test Incidence (% of outcomes classified as “default”)

10. Given that the base rate of default in the population is 25%, any test that is reducing uncertainty will have a Positive Predictive Value (PPV)…

…Less than .25
…Greater than .25
…Equal to .25

11. Given that the base rate of default in the population is 25%, any test that is reducing uncertainty will have a Negative Predictive Value (NPV)…

…Less than .75
…Greater than .75
Equal to .75

12. Confusion Matrix Metrics. To determine all performance metrics for a binary classification, it is sufficient to have three values

The Condition Incidence (here the default rate of 25%)
The probability of True Positives (the True Positive rate multiplied by the Condition Incidence)
The “Test Incidence” (also called “classification incidence” – the sum of the probability of True Positives and False Positives)
These three values can all be obtained from the AUC Calculator Spreadsheetand and then used as inputs to the Information Gain Calculator Spreadsheet to determine all other performance metrics.

AUC_Calculator and Review of AUC Curve.xlsx
Information Gain Calculator.xlsx
Question: What is your model’s True Positive Rate?

Save this answer as it will be needed again for Part 3 (Quiz 3)

Enter answer here

.30

13. Question: What is your model’s “test incidence”?

Save this answer as it will be needed again for Part 3 (Quiz 3)

Enter answer here

.20

Mastering Data Analysis in Excel Quiz Answer

Part 2: Should the Bank Buy Third-Party Credit Information?

1. Introduction
Part 2 is intended to illustrate how binary classification performance metrics make it possible for you to put an exact value, in dollars per event, on new information that relates to a predictive model.

Note that new information will be worth far more if it is compared to no forecasting model rather than the state of partial knowledge available from the current model. Sellers of information (and data science consultants!) love to take credit for any information gain they achieve over the base rate.

Very often some intermediate state of knowledge is already available for which no additional spending is required. Evaluating the realistic incremental financial gain from new information, whether licensing a third-party commercial database or collecting new data internally, is therefore of great practical value, as this sets an upper bound on what your Company should be willing to pay to license or create the new information.

In this case study, your boss has been in discussions with an advanced machine-learning predictive-analytics credit-risk analytics company that claims to score individual probability of default with very high information gain. Let’s call the company Eggertopia. Eggertopia sales representatives claim their pre-processed risk-scores can achieve AUC values as high as .85 or even higher. However, Eggertopia scores are sold per-event, and they are expensive!

Your boss asks you to determine the incremental financial value to the bank of purchasing Eggertopia risk scores on future credit-card applicants.

Eggertopia agrees to apply its algorithms to generate credit scores for the 400 individuals in the Training and Test Sets. Eggertopia scores do not need to be combined with anything else to make a model. However, since the scores range from approximately -600 (best credit risk) to 4900 (most likely to default) they will need to be standardized and adjusted to fit the -3.5 to 3.5 range of the AUC Calculator Spreadsheet (below)

AUC_Calculator and Review of AUC Curve.xlsx
You will determine the sustainable AUC of the Eggertopia scores, the sustainable cost-per-event, and the savings per event, when comparing Eggertopia data to the base rate forecast.

You will then calculate the incremental savings per event if you compare use of Eggertopia data to use of your current model developed in Part 1.

Question: What is the AUC of the Eggertopia Scores on the Training Set? Give your answer to two digits to the right of the decimal point.

.83
.85
.88
.95

2. What is the optimum threshold on the training set to minimize the average cost per test?

3. What is the average cost-per-event at the Training Set optimum threshold?

$640
$600
$500
$540

4. What is the AUC of the Eggertopia scores on the Test Set?

.85
.88
.80
.75

5. Using the same threshold as used on the training set, what is the cost per event of the Eggertopia scores on the Test Set? Round to the nearest dollar.

$838
$803
$833
$823

6. If the bank did not have your model, or any other way of forecasting default, what is the maximum (break-even) price per event that the bank could theoretically pay for Eggertopia scores? In other words, what are Eggertopia’s scores’ absolute savings-per-event?

Hint: Calculate the difference between the cost-per-event at a 25% default rate, and the cost-per-event using Eggertopia scores

$423
$425
$412
$418

7. What is the True Positive rate of the forecasting model using Eggertopia Scores?

.70
.72
.76
.74

8. What is its Positive Predictive Value (PPV) of the forecasting model using Eggertopia scores?

Hint: To calculate the PPV, divide the portion of True Positives by the total number of Positive Classifications. Review confusion matrix definitions and letter designations on the Information Gain Spreadsheet, [PPV is defined at Cell G41], obtain True Positive and False Positive Rates from the AUC Calculator Spreadsheet, and use algebra to solve.

Information Gain Calculator.xlsx

.52
.50
.48
.54

9. Incremental Financial Value of Eggertopia Scores

You calculated a cost per event for your own predictive model on Test Set data to answer Quiz 1 – Part 1, Question 6.

Incremental Financial Value of Eggertopia Scores

You calculated a cost per event for your own predictive model on Test Set data to answer Quiz 1 – Part 1, Question 6.

Question: Assuming that the performance of the Eggertopia model and your model both remain stable on any future data (a big assumption), what is the maximum, or break-even, price that the bank could pay per score for Eggertopia, given that it already has your model and data?

Mastering Data Analysis in Excel Quiz Answer

Part 3: Comparing the Information Gain of Alternative Data and Models

1. Comparing the Information Gain of Eggertopia Scores and Your Model
Both the Eggertopia Scores and your binary classification model can be thought of as tools to reduce uncertainty about future default outcomes of credit card applicants.

Your own model, developed in Part 1, identifies dependencies between, on the one hand, the six types on input data collected by the bank, and on the other hand, the binary outcome default/no default.

If we assume that the dependencies identified by Eggertopia Scores and by your model on the Test Set are stable and representative of all future data (a big assumption) we can draw some further conclusions about how much information gain, or reduction in uncertainty, is provided by each.

Definitions are given in the Information Gain Calculator Spreadsheet, provided below.

Information Gain Calculator.xlsx
Question: On your model’s Test Set results, what is the conditional entropy of default, given your test classifications?

Hint: you need your model’s true positive rate from Part 1, Question 12, and “test incidence” [proportion of events your model classifies as default] from Part 1, question 13. Use the condition incidence of 25% and your model’s True Positive rate to calculate the portion of TPs. Then you have the inputs needed to use the Information Gain Calculator Spreadsheet.

What do you think?
Your answer cannot be more than 10000 characters.

2. Recall that the entropy of the original base rate, minus the conditional entropy of default given your test classification, equals the Mutual Information between default and the test.

I(X;Y) = H(X) – H(X|Y).

The population of potential credit card customers consists of 25% future defaulters. The base rate incidence of default (.25, .75) has an uncertainty, or entropy, of H(.25, .75) = .25*log4 + .75*log1.333 = .8113 bits.

Question: On your test set results, what is the Mutual Information, or information Gain, in average bits per event?

What do you think?
Your answer cannot be more than 10000 characters.

3. Recall that Percentage Information Gain (P.I.G.) is the ratio of I(X;Y)/H(X).

Question: on your Test Set results, what is the Percentage Information Gain (P.I.G.) of your model?

What do you think?
Your answer cannot be more than 10000 characters.

4. Since you have, for you model on the Test Set, a savings-per-event, and a bits-per-event (Mutual Information) you can calculate a savings-per-bit. This is a powerful concept, because it places a financial value directly on the information content of a model (or additional data source, like the Eggertopia scores).

Question: How many dollars does the bank save, for every bit of information gain achieved by your model?

What do you think?
Your answer cannot be more than 10000 characters.

5. Information Gain of Eggertopia Scores over the Base Rate

For questions in this section, assume your model and the data it uses are not available – the bank’s choice is between Eggertopia scores and the base rate.

Question: What is the Mutual Information of the Eggertopia Scores?

In other words, on the Test Set, What is the information gain, in average bits per event, over the base rate of (.25, .75) offered by the Eggertopia Scores?

.1305 bits per event
.1255 bits per event
.1243 bits per event
.1205 bits per event

Click Here To View The Answer

6. On the test set, what is the Eggertopia scores’ Percentage Information Gain (PIG)?

14.85%
15.35%
15.25%
13.95%

Click Here To View The Answer

7. If Eggertopia data were free, and your model was unavailable, what would the dollar savings per bit of information extracted be?

Dollar savings are $412 rounded to the nearest dollar- from quiz 2, question 6

Value would be $427 per bit.
Value would be $3,427 per bit.
Value would be $3,627 per bit.

Click Here To View The Answer

8. Incremental Information Gain of Eggertopia Scores Compared to Your Model and Available Data (any answer scores)

(For this section, assume your Model and the Data it uses are available).

Question: What is the incremental information gain of the Eggertopia scores, over your model from Part 1, in average bits per event, if any?

What do you think?
Your answer cannot be more than 10000 characters.

9. What is the maximum (break-even) price the bank should pay for Eggertopia scores, per score, if your model from Part 1 and data are already available?

What do you think?
Your answer cannot be more than 10000 characters.

10. At the above maximum (break-even) price per score, what would be the value per bit of incremental information gained from the Eggertopia scores? Give your answer in $/bit.

What do you think?

Mastering Data Analysis in Excel Quiz Answer

Part 4: Modeling Profitability Instead of Default

1. Modeling Profitability Instead of Default

Modeling Profitability Level as a Continuous Output (Instead of Binary Classification Default/No Default)

Introduction

Both your own model and the forecast based on Eggertopia scores are binary classifications: they forecast one of just two outcomes: “Default” or “No Default.” Your boss is interested in the idea that it might be preferable instead to model and forecast profits and losses as continuous values, using a a multivariate linear regression model on the same six input variables. This idea has arisen because the bank has been reviewing individual profit and loss numbers for each customer over the three-year period and has made an interesting discovery: some defaulting customers carried so much debt for so long, and paid so much interest on it, that they were profitable for the bank even though they defaulted! Many customers who seem to have risky spending behaviors are also among the most profitable for a lending business. And, at the opposite extreme,customers who always paid off their cards in full each month never defaulted but were not very profitable: the bank barely broke even, or even lost money, on its“safest” borrowers.

Your boss asks you to forecast each applicant’s expected profitability, in dollars,before deciding whether or not to issue them a credit card. He wants to know how reliable this type of forecast would be: what is the range above and below the point estimate that will be correct 90% of the time?

Although it might be possible to combine the six inputs in other ways, in the interests of time and focusing on the key learning objectives, we will use only a simple linear combination of the six input variables for Part 4 of this Project. (You should not include the Eggertopia Scores as an input variable).

Question 1 is about the coefficients or “betas” used to combine the standardized inputs to get the best-fit-line on standardized outputs on the Training Set. We then use those fixed betas to measure the observed residual error of the model on the Test Set.

Questions 2 through 6 concern the forecasts on the Test Set.

Questions 7 through 11 look at the Training Set results so that they can be compared (for possible over-fitting) against the Test Set Results.

Questions 12 through 14 are about the uncertainty that remains in a new individual forecast of profitability.

Use the Excel “Linest” function on the six inputs and profitability output on the 200 Training Set applicants to calculate the coefficients (the “betas”) that result in the best-fit line.

1. Question: Do you feel prepared to take this quiz?

Click Here To View The Answer

2. Question: What are your values for each “beta” on the Training Set?

Age
Years at current employer
Years at current address
Income over the past year
Current credit card debt
Current automobile debt

.01, .19, -.07, .64, -.06, 0
01, -.19, -.07, -.64, -.06, 0
.01, .19, .07, .64, .06, 0

Click Here To View The Answer

3. For this question, use the Liner Regression Forecasting explanation and Excel spreadsheet.

Question: What is the root-mean-square residual (the standard deviation of model error) on Standardized output for the Test Set?

.5835
.8109
0.6750
.6875
.3250

Click Here To View The Answer

4. For this question, use the Linear Regression Forecasting Explanation and Spreadsheet.

Question: What is the observed correlation R on the Test Set?

0.7378
.8095
.7590
.7332

Click Here To View The Answer

5. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet.

Question: What is the Standard deviation of model error, in Dollars, for the Test Set?

$3,996.81
$3,411.80
$3,885.14
$3,379.36

Click Here To View The Answer

6. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet:

Question: What is the 90% confidence interval, in dollars, for the Test Set?

$6,390.49 above the point estimate, and $6,390.49 below the point estimate
$5,611.91 above the point estimate, and $5,611.91 below the point estimate
$6,574.17 above the point estimate, and $6,574.17 below the point estimate
$5,558.55 above the point estimate, and $5,558.55 below the point estimate

Click Here To View The Answer

7. What is the Percentage Information Gain (P.I.G.) on the Test Set?

27.7%
18.9%
26.4%
37.2%

Click Here To View The Answer

8. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet:

Question: What is the Correlation, R, of your model on the Training Set?

.7505
.7805
.8095

Click Here To View The Answer

9. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet:

You need to quantify the uncertainty in a regression model forecast of applicants’ future profitability. Assume that both the forecast profits and the errors have a Gaussian distribution. You will calculate the standard deviation of model error on standardized data, the standard deviation in dollars of the model error, and the 90% confidence interval for profitability estimates.

Question: What is the standard deviation of your model error on the standardized Training Set output?

.587
.487
-.487
-.587

Click Here To View The Answer

10. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet.

Question: What is the standard deviation of model error in dollars on the Training Set?

**This may seem similar to question 5, but Q5 refers to the Test Set.

$3,379.36
$4,379.36
$5,500.87
$4,312.91

Click Here To View The Answer

11. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet.

Question: What is the 90% confidence interval, in dollars, on the Training Set?

**This may seem similar to question 6, but Q6 refers to the Test Set.

$5,558.55
$6,211.18
$5,328.93
$7,128.55

Click Here To View The Answer

12. For this question, use the Linear Regression Forecasting explanation and Excel spreadsheet.

Question: What is the Percentage Information Gain (P.I.G.) on the Training Set?

**This may seem similar to question 7, but Q7 refers to the Test Set.

36.5%
37.5%
41.4%
32.4%

Click Here To View The Answer

13. Questions 13 through 15 use the same example applicant.

The following data are known about the sample applicant:

Age: 42.00

Years at Employer: 12.44

Years at Address: 0.9

Income: $121,400

CC debt: -34,228

Auto debt: -23,411

To convert above inputs to standardized form, locate the Training Set Spreadsheet (first bottom tab of workbook) in the Data for Final Project Workbook.

Data_for_Final_Project.xlsx
Use the input means [Cells C207:H207] and standard deviations [Cells C209:H209].

Use the Training Set profitability mean [$1,905.51] and standard deviation [$5755.91] from the Profit and Loss (last bottom tab) Spreadsheet.

Use the Test Set standard deviation of error on standardized outputs of .6750

Question: What is the point estimate of profitability, in dollars?

$10,683.61
$11,109.61
$8,451.61
-$10,683.61

Click Here To View The Answer

14. The following data are known about the sample applicant:

Age: 42.00

Years at Employer: 12.44

Years at Address: 0.9

Income: $121,400

CC debt: -34,228

Auto debt: -23,411

To convert above inputs to standardized form, locate the Training Set Spreadsheet (first bottom tab) in the Data for Final Project Workbook.

Use those means [Cells C207:H207] and standard deviations [Cells C209:H209].

Use the Training Set profitability mean [$1,905.51] and standard deviation [$5755.91] from the Profit and Loss (last tab on bottom) Spreadsheet

Use the Test Set standard deviation of error on standardized outputs of .6750

Question: With 50% confidence, what is the range of profitability?

Range from $13,304.16 to $8,063.06.
Range from $12,962.61 to $10,683.61
Range from $11,823.28 to $9,543.94
Range from $10,683.61 to – $2,278.99

Click Here To View The Answer

15. The following data are known about the sample applicant:

Age: 42.00

Years at Employer: 12.44

Years at Address: 0.9

Income: $121,400

CC debt: -34,228

Auto debt: -23,411

To convert above inputs to standardized form, locate the Training Set Spreadsheet (bottom tab) in the Data for Final Project Workbook.

Use those means [Cells C207:H207] and standard deviations [Cells C209:H209].

Use the Training Set profitability mean [$1,905.51] and standard deviation [$5755.91] from the Profit and Loss (bottom tab) Spreadsheet

Use the Test Set standard deviation of error on standardized outputs of .6750 .

Question: With 99% confidence, what is the range of profitability?

Range from $10,683.61 to -$8,704.31
Range from $19,388.27 to 10,683.61.
Range from $16,388.27 to -$7,704.31
Range from $20,691.32 to $675.90.

Click Here To View The Answer

16. Comparing Test Set and Training Set Performance

Question 15: Between the Training Set and the Test Set, the dollar value of the standard deviation of model error…

Increased by more than 50%, which leads to the conclusion of model over-fitting.
Increased by more than 25%, which suggests possible model over-fitting.
Decreased by about 15%, which suggests a very strong model on Test Set data.
Increased by less than 20%, which suggests minimal model over-fitting.

Click Here To View The Answer