DSCI 5180 (Spring 2020) Project Instruction and Guidelines Data Analytics is a subject that can be best appreciated only when applied to a dataset you are familiar with. The aim of this project is to achieve that. Do not view this project as a hurdle in the course, rather a bridge to connect the topics you learnt to your work or subject domain.

DSCI 5180 (Spring 2020) Project Instruction and Guidelines

Data Analytics is a subject that can be best appreciated only when applied to a dataset you are familiar with. The aim of this project is to achieve that. Do not view this project as a hurdle in the course, rather a bridge to connect the topics you learnt to your work or subject domain.

There are five main modules in this course

Module 1 : Normal Distribution (Percentile, distribution of means, and chance of occurrence if we assume normal distribution)

Module 2 : Confidence Interval Estimation (Including Sample Size determination)

Module 3 : Inferences from data (Hypothesis testing, i.e., confirming or checking if a claim made about the data. In this module, we dealt with only one sample)

 

Module 4 : More Inferences from data (Multiple samples)

 

Module 5 : Regression analysis (Both simple and multiple, apart from basic ANOVA)

 

 

 

Objective : The purpose of the project is for you to apply what you learnt from at least 4 modules on your dataset and make some inferences or estimations.

 

 

Data source:

 

There are 3 options, you can choose one of them (there are no restrictions on that)

 

  1. Bring your own data from work (you can remove any private or confidential information, for example: if you are bringing any sales or cost data of an item/product or service – the name can be masked)

 

  1. Use data from your previous work or company you have access to (again you can remove any private/confidential information)
  2. Use data from public domain – In today’s world, there is no dearth of structured data. Here are some places where you can get data from

 

 

 

 

 

 

 

  • Amazing visualization or graphics

 

  • But remember, we need the data to do analysis, if you look at the bottom of any figure – Google would provide the source name, and you can retrieve data from there.

 

  • Any sports data (from the appropriate website, getting data in structured format for several years might be challenge, but a few minutes or an hour – you can do it)

 

  • For example – Cricket data could be obtained from

 

(http://stats.espncricinfo.com/ci/engine/stats/index.html)

 

  • Or any data source you have access to

 

 

 

Grading Rubric

 

Total Points : 60 points

 

 

 

Midterm report : 20 points

 

(Due date : Feb 20th, 11:59pm)

 

No more than 1.5 to 2 pages. You should describe your source of data (including the data fields you have) and what you want to accomplish based on the topics you learnt. You can state the research hypothesis you plan to check, confidence intervals you plan to estimate, or test any relationship between variables you think is important.

 

I will provide feedback within 5 days to each of you (if you submit early, you get your feedback early), if I feel any change is needed – I will indicate that.

 

How is the 20 points given

 

Your Data : 10 points

 

Your plan of action : 10 points

 

 

 

Final report : 40 points

 

(Due date: March 6th, 11:59 pm)

 

Main report : No more than 1.5 to 2 pages.

 

Present the findings using the skillset acquired (topics covered) in class.

 

Also include the dataset with the analysis (could be excel or any statistical package). You should provide details of the analysis in an Appendix.

 

How is the 40 points given, 10 for each Module you choose to apply. (For example, you choose regression to test an association or predict an outcome, you get 10 points for that analysis)

 

SAMPLES (These are some examples, for each sample – I am showing you how we could use lessons learnt from one or two modules)

 

SAMPLE 1

 

Currentprices.com keeps a record of the sales prices of gasoline ($/ gallon, at pump) at different retailing pumps/ locations. The data on regular unleaded gasoline, as recorded at 37 different pumps at 4 different locations, viz., Allen, Blaze, Corlis, and Dustin. The data is presented in the spreadsheet entitled Sample1.xls

 

Applying lessons learnt from Module 2:

 

You can construct 95% confidence intervals for the mean price of regular gasoline for each of the four locations

 

Applying lessons learnt from Module 4:

 

Is there any difference between the true mean prices at the four different locations using the ANOVA procedure

 

SAMPLE 2

 

Corporations with international operations need to assess the risks associated with setting up and maintaining operations in different regions of the world. Consideration of the risks include considering such issues as political and economic stability. One indicator of the healthcare and quality of life in a country or region that is considered correlated with the risk and stability in the region is the child mortality rate. As a result, the healthcare and quality of care as measured by the child mortality rate in a region can impact the type and amount of investment in a region and countries within a region. The child mortality rate is the number of children 5 and under that die per thousand people in the population. The Inter-agency Group for Child Mortality Estimation (IGME) was obtained from www.childmortality.org.

 

The Excel file for this assignment has labels for the year, country name, and continent group.

 

NAME OF FIELD DESCRIPTION
Country Name of the countries
Continent / Region Countries were categories based on region
  into 8 continents. The coding scheme was
  1= South-East Asia, 2 = South Asia, 3 =
  Western Europe, 4 = Eastern Europe, 5 =
  Africa, 6 = South America and Islands, 7 =
  Oceania with Australia and New Zealand, 8
  = Middle East, 9 = North America
CMR1998 child mortality rate for 1998
CMR2008 child mortality rate for 2008

 

Applying lessons learnt from Module 4:

 

We will test the following hypothesis:

 

  • the mean child mortality rate for countries in Africa is more than the mean Child mortality rate for countries in South Asia. (α = .05),

 

  • the mean child mortality rate for countries in Eastern Europe is more than the mean Child mortality rate for countries in the Middle East. (α = .01),

 

  • the mean child mortality rate for countries in South East Asia is more than the mean Child mortality rate for countries in Western Europe, (α = .05) by 10 per thousand.

 

 

SAMPLE 3

 

Food Lion Inc. would like to investigate the feasibility and future prospects of setting up stores in Denton County. Food Lion has provided you with a sample from a database of household financial variables. The sample contains 100 records. The various fields in the sample are:

 

Income1 : Annual income of head of household or primary wage earner

 

Income2 : Annual income of secondary wage earner

 

Famlsize : Size of family (number of people permanently residing in the household)

 

Ownorent : 1 if household is owned; 0 if it is rented

 

Autodebt : Automobile related debt pending for wage earners in the household

 

Hpayrent : Household mortgage payment or rent per month

 

Groc : Monthly expenditure on groceries

 

Loc : 1 = East Denton (E); 0 = West Denton(W); -1 = North Denton (N); 2 = South Denton (S)

 

Applying lessons learnt from Module 5:

 

We will use regression techniques to predict the monthly expenditure on groceries, of families who either rent or own their homes in Denton.

 

 

 

SAMPLE 4

 

Alpha Tire Company wishes to offer a warranty for their tires. In order to perform an assessment of the tires’ durability, a sample of 41 newly manufactured tires is selected at random. The mileage at which each tire will fail needs to be determined. Rather than driving the tires around for months until they fail naturally (field test), the tires are installed on a special machine that simulates wear and tear (lab test) . The machine operates at high speed, high temperature, and applies high pressure on the tires against a rough surface. As a result, the tires blow up in just a few hours. A conversion formula is then used to compute each tire’s simulated mileage. At this

 

time Alpha Tire believes the meant time to failure is 40,000 miles (μ= 40,000) with standard deviation of miles to failure to be 3700 or (σ =3700)

 

 

Applying lessons learnt from Module 1 and 2:

 

  • The feasibility of offering a 35,000-mile warranty on their tires, i.e. what percentage of tires would fail before 35,000 miles!

 

  • Assuming that the tire mileage is normally distributed and the mean number of miles to failure is not known nor is σ. Using the sample of 41 tires as our estimate of the mean (X Bar) and s = the sample standard deviation; we will estimate the 95% confidence interval for the number of miles to failure.