View on GitHub

Business Analytics

Content from my Business Analytics Fall 2018 class at CU Boulder

DataRobot project – Diabetes Data

Todo before assigning again


  1. Friday at midnight - Business understanding through Data Preparation
    • A word document with business understanding and data understanding alteryx workflow (yxzp) showing all data preparation
  2. Next Tuesday Night - Data Understanding through Modeling
    • A word document of your draft of data descriptives plus modeling writeup Show a screenshot or something of datarobot showing me your models screen
  3. Thursday Night - Evaluation to End (final deliverable)
    • One word doc with the complete report – all sections combined
    • Alteryx workflow
    • Share with [email protected] your datarobot project so I can take a peek if necessary. You can share projects with your teammates, you should do that too. Please put your canvas team number in your datarobot project name! Don’t forget the managerial holdout predictions, as mentioned in the report.


For this project, you will analyze the following data from a diabetes study.

The paper, and the uci page, describe the data as follows:

The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

  1. It is an inpatient encounter (a hospital admission).
  2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
  3. The length of stay was at least 1 day and at most 14 days.
  4. Laboratory tests were performed during the encounter.
  5. Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

Problem statement: Management is concerned about the number of diabetic patients who are readmitted to the hospital within 30 days. If they are readmitted within 30 days, then this is a sign that proper care may not have been administered during the previous visit. If they are readmitted after 30 days, then management is less concerned about potential insufficient care. You have been asked by management to analyze the data on visits and develop a model that can predict whether a patient will be readmitted to the hospital.

You will write a report following the CRISP-DM structure. Remember that your audience is management of the hospital and every figure/table included in the report needs to be accompanied by an explanation of why it is included/significant to your findings. Specifically, your report should have the following sections:

Business Understanding- 5%

In this section, describe the purpose of the project. Articulating the purpose will help you in later stages of the CRISP-DM process. Please do not copy-paste what I wrote above

Data Understanding- 5%

In this section, describe the original datasource before any data preparation. E.g., Number of rows, what each row represents, data source (not just what website it came from, but where the actual data was collected from and what it looks like). Please do not simply copy-paste the above

Data Preparation- 10%

In this section, describe the feature engineering that you performed, with enough detail that someone else could reproduce your work. Your report may be handed to the engineering team, and they may need to be able to reproduce what you did. This section should be a written explanation of how you altered your data. The actual files and outputs created should belong in the appendix (explained further below):

Actually perform the following feature engineering and data cleaning. The following list includes quotes extracted from the research paper.

Data Understanding pt.2- 10%

In this section, provide summary statistics for each of your features, for each of the values in the table show below. Also include sample size after data preparation above. You should try to obtain these statistics using DataRobot. Notice in the table below that additional “binning” was undertaken on some features, including Discharge Disposition and Admission Source.

      Readmitted Readmitted
Variable Number of encounters % of population Number of encounters % in group
Not measured        
High, changed        
High, not changed        
African American        
Medical Specialty        
General Practice        
Internal Medicine        
Diagnosis (primary)        
Numeric Mean Median Std. Dev  
Time in hospital        

Modeling- 30%

In this section, give a summary of the tool that you used for modeling – i.e., briefly describe DataRobot, including its mission. You should use three different “feature lists”. Document which features you put into each list!!!!

Using DataRobot, induce some models on your data. For each feature list perform:

Combining feature lists with modeling approaches, you can do the following…

  Logistic Regression Best Autopilot Model
Feature list 1- Simple baseline Performance of Model 1 Performance of Model 2
Feature list 2- All features Performance of Model 3 Performance of Model 4
Feature list 3- Most informative from all Performance of Model 5 Performance of Model 6

Note: You should not be evaluating metrics in this table! Describe briefly how logistic regression works, and also how your deploy-recommended DataRobot model works. DataRobot includes some very complicated-and-impressive-sounding models. However, with a bit of googling, you should be able to grasp how they work on a high level – enough to be able to describe to upper management how they work, anyway. In-depth knowledge of their workings is not necessary, but do your best to understand and describe them.

Evaluation- 30%

In this section, describe the DataRobot data partitioning and validation procedure that your models were trained on. Indicate whether you used k-fold cross-validation or TVH and explain to management how this affords assessing the generalizability of your models.

First, report the performance of all models:

Then, perform a deeper interpretation dive into the model that you recommend for deployment

** Note – DataRobot provides “feature importance” and “model x-rays” descriptions for each model. Read about these in the documentation and in your textbook, but the gist is this:

Evaluation – Prediction against Management Holdout

Management will provide to you a dataset (available here with a few rows for which they will ask you to make a prediction of readmittance. They already know the answer, but they are using this data as a higher-level “holdout” of sorts to see how well your model can perform. To make these predictions, you should retrain the model that you are recommending for deployment on all available training data that you have uploaded to DataRobot. The assumption is that your model will be able to make better predictions against unseen data if it has been fed more training data.

Once you have trained your recommended-deploy model on the full training dataset, use that model to make predictions for the data that management provides. Submit a file with your predictions in your final deliverable. You only need to submit a csv with two columns:


High-level summary of your findings and recommendation.

Appendix- 10%

Include an appendix with enough in-depth detail about all of your feature engineering and data restructuring that a programming team could start from raw data in the format that you received it, and modify it sufficiently that it could be used for modeling / predictions: