Raw data consisted of multiple files/ format each containing millions of patient records across several years. The data contained both claims (procedural and drug) and clinical information collected from multiple internal system. The raw data contained duplicate records, as well as several incomplete records with missing fields. There were several date fields with entries in multiple formats, thereby increasing the complexity of the data manifold. In the existing process the client had traditional RDBMS databases where it used to take weeks to process a new file and dedupe against all existing data. The data loading process was also very onerous due to sporadic malformed records. The client also conducted a POC around using python for data cleaning and storing which reduced the processing time significantly but still not what the business was happy with.
From weeks to less than an hour
We proposed a spark based system right from reading the raw data files to pre processing and summarizing all data for feeding into dashboards. With the power of Spark the data loading process was greatly simplified – we used csv parser to auto infer the scheme and load without loss of data and then re covert the small number of erring fields to right data type. The window functions in spark were very useful in running complicated summarization including calculating medical adherence metrics and risk scores at patient level. The end result surprised client as we were able to complete the data loading and summarization for a year in less than an hour even with moderate cluster configuration. The process used to take weeks earlier. The processed data was later used to run machine learning and deep learning algorithms.