Home is wherever that has what I want.
Today's topic starts with a scenario study. The tempalte below can help you create complete data science problems in future applications.
"Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data - including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers (Kaggle users) to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful."
Many people are unable to apply for loans because they have no credit history. This group of people often turn to riskier lenders, which may lead them into a worse living situation than before.
Home Credit wants to provide this group of people with a good lending experience by relaxing loan conditions, but even if the conditions are relaxed, the company still cannot accept the occurrence of serious bad debts (non-payment).
Predicting the repayment ability of clients can help the company quickly refuse loaning to those who are unable to repay their debts, even if loan conditions are relaxed.
Credit Bureau access records
Home Credit internal records, such as past loan status, credit card status
Structured numerical and categorical data.
To address the classification problem of predicting whether each customer ID will repay their loans, you can use the Area Under the ROC curve (AUROC) to evaluate, using the 0~1 probability scale to present the final output.
In AUROC, 0.5 represents random guessing. The closer it is to 1, the better the prediction power of the model.
How much information do you have? How many sources did you got your data from? What is their format? What are the relationships between the data? For tables, what does each data field and row mean?
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. Preliminary EDA achieves 3 main purposes:
Understand data: obtain information, structures and characteristics contained in the data.
Outliers or abnormal values are found: check whether the data is incorrect
Analyze the relationship between variables: find important variables.
From EDA's process of observing phenomena, we check if our data conforms to pre-analysis assumptions.