Professional experience

I partner with clinical, legal, and operational teams to design data products that feel as polished as they are practical. Below is a snapshot of the roles and collaborations that continue to shape my approach.

Junior Data Scientist at the Medical Protection Society

In September 2021, I started as a Junior Data Scientist for the Medical Protection Society (MPS). MPS is the worlds leading medical indemnity supplier, offering indemnity cover to members from the UK, Australia, South Africa, Singapore and Malaysia amongst other countries. I am part of a Data science team which sits within the underwriting, pricing and insurance department where I have been responsible for deploying predictive models, presenting to executives on market analysis, gaining web-scraping experience as well as developing my skills in natural language processing.

CHURN Models

I am responsible for the UK Dental market and Consultant market CHURN models. I use advanced machine learning algorithms to predict whether a member will leave or not during their account. I have added value to these models by incorporating Survival Analyis into this problem, we can now predict the likelihood that someone leaves in 1,3 or 5 years, rather than only during the account as it was before. These models are built in Python using VS code, the data is collected using SQL within Azure data studio.

Presenting work

During my first year at the company, I have presented on a monthly basis to executive members of the company. As we have eased back into the office, I have lead workshops on Data science which help less data orientated members of the team grasp some data concepts. Through this work I have been awarded two company awards, once a month, a member of the company wins an 'Ambitious values' award, I was presented this accolade (and monetary prize) in March 2022 and September 2022.

Unstructured data

There is a wealth of long form text data at MPS, until recently, it has been unused. Since joining, I have joined a team that is looking at extracting value from this data. Through self teaching and online resources, I have developed my NLP skills in Python and R, and I now offer value to this project where I am the owner of Python code for the project. The data is pulled in to Python from SQL and I am tasked with various tasks to improve the process. The largest contirbution to this project was to use machine learning to select the most valuable document from a list of documents. The accuracy was originally 51%, I increased this number to 65%, which meant the quality of the data being fed into the project was significantly higher.

NHS Data Science Placement

This project made up one third of my MSc and involved working collaboratively with the NHS to provide insight into weaknesses in the A&E system. I was provided with the raw data set which gave me great exposure to what real life data was like. Previously, I had only worked with squeaky clean data provided by the university. The goal of the project was to find certain groups of patients that were more at risk of lengthy delays in hospital, whether this be personal characteristics or hospital factors.

Data Visualisation

One of the most important parts of this project was to provide a clear structure of the A&E pathway. This involved producing flowcharts which showed the possible routes that a patient could take through the hospital.

Data Cleansing

As mentioned, the data were not cleansed before it was passed on to me. The first issue was simple, and simply involved creating a target variable. This took two forms as there were two different approaches. The first response variable was a time to disposal (which was a rather inhumane term which referred to patients leaving A&E via discharge, transfer, death etc.), the second variable was a binary response which stated whether a patient was disposed of within 4 hours of arrival.

The second, and more troubling aspect of data cleansing came in the form of missing data. Only 57% of the data were complete cases, meaning no variable was missing. Predictive mean matching was used to impute data where it was missing, the data were checked thoroughly to make sure it was imputed appropriately.

Methodology

As mentioned, two similar, but different problems were tackled. The first and more common data science problem was that of the classifier, predicting whether a patient would or would not leave A&E within four hours. Two classifiers were used here, XGboost and Logistic Regression. The focus on these models was not so much the model performance, although this was important, but more so the feature importance. This worked differently for both models, feature importance is a model output with XGboost whereas the coefficients were inspected from logistic regression.

The second method was Survival Analysis. This focused on the time to event which is a different branch of predictive modelling. It has one key difference to regression, the idea of censoring. Censoring was required for this problem which meant the task leant itself to this type of analysis. A Cox proportional hazards model was used and similarly to the classifiers, the effects of variables was more important than the model itself.

Knowledge sharing

The final results from the model were presented during my dissertation viva and also more generally presented back to members of the NHS. The audience was varied in its data science understanding which required creativity on my part when presenting the work.

Education

In 2016, I undertook a BSc in Economics at Lancaster University. I graduated in 2019 and took a year out to save for a MSc. One year later, I began my MSc in Data Science back at Lancaster. I graduated in 2021 with a distinction, averaging 79% in my final grade. The masters course taught a range of topics and offered 3 pathways, I took the business intelligence pathway. Some of the key modules I studied are described below.

Statistical Learning - This module was the most important from a Data Science perspective. It was taught in my final term and it involved advanced statistical methods. The final project involved creating classifiers using XGBoost and regularized logistic regression which would take an image of a lung scan as the input, the models would then predict if this scan is covid positive or negative.
Statistical methods and modelling - This module supplied the framework for machine learning models, we studied regression in depth. The course was assessed through a group project where we had to create a regression model that predicted someones weight, given other factors
Intro to intelligent data analysis (data mining) - This was my first exposure to machine learning within Python. The module taught me about the CRISP-DM process from start to finish. There was a particular focus on data preparation, including Prinicpal Component Analysis (PCA). The final assessment involved clustering weather data (k-means, mean shift and hierarchical) and then classifying new data into one of these clusters (k-nearest neighbour and Bayes).
Programming for data scientists - This module was a personal favourite of mine, it gave me the opportunity to upgrade my Python and R skills. Every week I was set a programming challenge that inlcuded tasks like recursion, developing a decision tree which maximises entropy and list manipulation (pop etc.).
Forecasting - Another favourite module of mine, this focused on business forecasting in R. Two methods were focused on, (S)ARIMA and exponential smoothing. I was tasked with forecasting withdrawals from cashpoints around the UK.