This FinTech case study using statistical measurement for assessing natural language processing (NLP) performance is influenced from Gravitate client work. Founder Qiuyan Xu, PhD, FRM, co-presented at the JSM 2022 conference with Ying Li, Xiyu He.
In sharing the findings from this work, Gravitate was encouraged by the positive feedback. Gravitate's vision is to make AI a reality everywhere we can, thus inspiring our sharing with a wider audience.
Machine learning (ML) and Artificial Intelligence (AI) technologies have fueled unprecedented levels of interest across the financial industry.
Natural Language Processing techniques in FinTech are applied in many different use cases, such as credit scoring, fraud detection etc.
Deep learning models with neural networks are used in such applications now, and different metrics can be used to track performance.
Reference Natural language toolkit and huggingface.co to dig into these terms further.
Challenges for NLP performance tracking
Here are five performance metrics typically used in developing a deep learning algorithm for natural language processing in binary classification.
Precision: prediction of what is correct is actually correct.
Recall: number of correct predictions captured.
F1 Score: balance between precision and recall.
False Positive: what was predicted to have the condition, actually does not have it.
False Negative: what was predicted to not have the condition actually, has it.
Deep Learning Algorithm performances have many variations, and are often difficult to reproduce.
Variations can result from:
Different deep learning architectures, such as BERT, GPT, etc.
The stochastic nature of the optimization process, with the choice of the different hyperparameters used in the same model.
Data sources, such as different sample sizes, class ratios and out of domain test set distributions.
FinTech NLP performance use case: verifying alternative data relevance
Large and small non-banking financial institutions (NBFIs) are developing their own proprietary credit scoring systems that include selected non-traditional, non-financial data, otherwise known as alternative data. Adding in alternative data gives them a unique view into portfolio investment risks and opportunities.
During a recent project Gravitate had a role in developing, one such alternative data component required a machine learning system to identify if an SMS message is a finance related message or not, which in turn is used in a novel credit scoring algorithm. The smaller problem is a binary classification problem.
Managing sources of variation
Architecture and parameter changes contribute to model performance. In real applications, data size and data quality is often the larger influencer in final results, and it caps the upside.
Most applications in deep learning require significant sample size. Different teams might have different sample sizes to train and validate their models even for similar applications.
They could also have different ratios for different classes, especially for imbalanced data applications such as fraud detection, and newer businesses might just not have enough historical data.
Even if teams can access similar training and validation data, when applications are deployed in the real world, algorithms can face “test” data that has very different distributions.
Common factors to measure NLP for variation
We assessed three variations in source across each of these architecture, data and hyper parameters.
Architecture
BERT
DistilBERT
Roberta
Data
Training and validation sizes
1:0 class ratios
Testing out of domain variation
Hyper Parameter
Max length
Learning rate
Batch size
Weight decay
So which common factor variations did we assess?
Train/Validation Data Variation
Class Ratio Variation
Out of Domain Test Variation
Hyper Parameter Variation
For example, with hyper parameter variation, most data has a long tail for sentence length. Max length determines if we chop a sentence and cap it with a certain number of characters. This parameter can vary. With millions of text messages, a shorter Max length might help with algorithm efficiency, but might sacrifice some accuracy for longer text messages that exceed this cap.
Learning rate is another parameter that usually varies. It can be a significant factor to affect performance, although some optimizers with auto adaptive learning rates are less affected.
Larger batch sizes tend to lead to better performance, but consume more computing resources.
Weight decay is a regularization technique in deep learning. Weight decay works by adding a penalty term to the cost function of a neural network which has the effect of shrinking the weights during backpropagation. In our case study, higher decay leads to lower precision, but higher recall.
How to measure, report NLP performance metrics
Given the significant differences in performance metrics, it is important to use the right metrics suite for the application, with ranges, and ideally confidence intervals, instead of just one or two numbers. It is also important to be transparent and disclose the algorithm setting and data aspects.
Report full sets of metrics, but know what is more important.
Measure with ranges and confidence intervals.
Ensure transparency and user education
Next steps in assessing NLP performance
Data science grows through our commitment to continually learning. Dr. Xu and team reflected on the work during the workshop inviting ideas and additional questions. Opportunities identified include exploring more difficult use cases to this similar model, developing better labeling methods and improving on the "bad case" analysis.
Data scientists and AI/ML developers interested in working on these models should reach out to Gravitate for potential collaboration opportunities.
留言