Privacy-Preserving RAG System for Financial Contracts
On-premise AI Solution with Advanced Document Intelligence
Situation & Privacy Challenge
Financial institutions require AI solutions that balance performance with data privacy:
- Data Sensitivity: Financial contracts contain highly confidential information
- Performance Needs: 4+ hour manual review per complex contract
- Privacy Constraints: Prohibition of external API calls for sensitive data
- Accuracy Requirements: Need for >90% accuracy in legal document analysis
Task & Technical Objectives
Develop a fully on-premise RAG system meeting strict privacy requirements:
- Privacy-First Architecture: 100% on-premise deployment, no external APIs
- Open-Source LLM Integration: Implement Llama 3 for local inference
- Advanced Chunking: Semantic chunking preserving document structure
- Optimized Retrieval: Implement re-ranking for precision improvement
- Maximum Efficiency: Target 70%+ reduction in processing time
Action & Technical Innovations
🔒 Privacy-Preserving Architecture
- Full On-Premise Deployment: Complete control over data flow
- Llama 3 Integration: 8B parameter model running locally
- Air-Gapped Capability: Operates without internet connectivity
- Encrypted Storage: Vector database with at-rest encryption
📄 Advanced Document Processing
- Semantic Chunking: Intelligent segmentation preserving:
- Legal document structure (sections, subsections)
- Financial table integrity
- Cross-reference preservation
- Hierarchical relationships
- Context-Aware Processing:
- Document type detection (NDA, MSA, SOW, etc.)
- Party identification and role extraction
- Jurisdiction and governing law detection
âš¡ Optimized Retrieval Pipeline
- Multi-Stage Retrieval:
- Initial Retrieval: Dense embeddings with sentence transformers
- Re-ranking: Cross-encoder models for precision improvement
- Context Expansion: Adjacent chunk inclusion for coherence
- Performance Optimizations:
- Batch processing for parallel inference
- Model quantization for reduced memory footprint
- Cache layer for frequent queries
Key Technical Innovations:
- Developed document-structure-aware chunking algorithm
- Built multi-modal re-ranker combining semantic and lexical signals
- Designed privacy audit trail for compliance reporting
Results & Impact
- Unprecedented Efficiency: Contract review time reduced from 4 hours to 72 minutes (70% reduction)
- Enterprise-Grade Privacy: Full compliance with financial regulations and data sovereignty requirements
- Superior Accuracy: 90%+ accuracy achieved through re-ranking and domain adaptation
- Cost Optimization: Eliminated external API costs while maintaining performance
📈 Performance Benchmarks vs Alternatives
| Metric | Our Solution | GPT-4 API | Manual Review |
|---|---|---|---|
| Accuracy | 92% | 94% | 85% |
| Data Privacy | 100% | 0% | 100% |
| Cost per Document | $0 | $2.50 | $200 |
Key Achievements
Privacy Innovation
Delivered enterprise-grade privacy with 100% on-premise deployment
Performance Excellence
Achieved 70% time reduction while maintaining 90%+ accuracy
Technical Leadership
Successfully implemented Llama 3 with advanced RAG optimizations
Business Impact
Enabled secure AI adoption in highly regulated financial environment
System Architecture Overview
Document Ingestion
PDF/OCR, parsing
Semantic Chunking
Structure-aware
Vector Storage
Local embeddings
Retrieval & Re-rank
Multi-stage
Llama 3 Generation
On-premise
Intelligent Email Classification System
DBS Banking - Multi-label classification for customer service optimization
Situation & Challenge
DBS banking teams were receiving hundreds of emails daily with varying urgency levels based on email types/labels. The manual classification process was:
- Time-consuming and inconsistent across different labelers
- Prone to human error in identifying urgent vs non-urgent emails
- Unable to handle the volume efficiently (multiple labels per email)
Task & Responsibilities
My primary responsibilities included:
- Design and implementation of NLP preprocessing pipelines
- Fine-tuning of deep learning models for multi-label classification
- Comprehensive evaluation and analysis of model performance
- Data quality investigation and process improvement recommendations
Action & Approach
Initial results showed poor model performance (~70% F1). Through deep data analysis, I discovered a critical issue:
- Inconsistency Discovery: Comparing emails with identical labels revealed significant variations when labeled by different people
- Process Intervention: Requested unified labeling standards and attended labeling sessions to understand data better
- Improved Preprocessing: Enhanced data cleaning and feature engineering based on labeling insights
- Model Optimization: Implemented advanced NLP techniques and fine-tuned models with improved data
Results & Impact
- Performance Leap: Improved F1 score from 70% to over 90%
- Process Standardization: Established consistent labeling protocols across teams
- Efficiency Gain
- Scalable Solution: System capable of handling growing email volumes
Key Achievements
Data Quality Insight
Identified critical labeling inconsistencies that were degrading model performance
Process Improvement
Implemented standardized labeling protocols that improved data quality
Performance Optimization
Achieved >20% improvement in classification accuracy
Business Impact
Enabled faster email processing and better customer service response times
Automated Financial Table Extraction System
Financial Spreading Automation - PDF Report Processing
Situation & Challenge
Manual extraction of financial tables (balance sheets, income statements, cash flow) from annual PDF reports was a major bottleneck in financial spreading processes:
- Time-consuming manual process prone to human errors
- Existing table extraction solutions not adapted to financial document complexity
- No standardized templates across different companies
- Multi-page tables with complex structures
Task & Responsibilities
My mission was to design, implement, and evaluate a robust automated financial table extraction pipeline:
- Analyze and evaluate state-of-the-art table extraction technologies
- Define the overall solution architecture and pipeline
- Create and annotate a reference dataset for training and evaluation
- Develop post-processing algorithms for table correction and restructuring
- Conduct comprehensive system performance evaluation
Action & Approach
I developed an innovative hybrid architecture combining multiple approaches:
- State-of-the-art Analysis: Evaluated existing tools (Camelot, Tabula) and deep learning models (CascadeTabNet, TableNet)
- Hybrid Pipeline Design:
- Discovery step using regex to identify pages containing target financial tables
- YOLOv3 model optimized with FinTabNet training for precise table region detection
- Custom post-processing module for cleaning extraction artifacts and reconstructing logical table structure
- Dataset Creation: Supervised creation of manually annotated dataset (291 tables from 100 annual reports)
- Custom Metrics: Developed ExactMatchSim metric and multi-criteria evaluation methodology
Results & Impact
- High Detection Rates: >95% detection for single-page tables, >96% overall document detection
- Excellent Accuracy: 94% column similarity, 92.1% row similarity, 95.7% number extraction accuracy
- Research Contribution: Created reference dataset and published results in scientific paper
- Business Value: Dramatically reduced manual extraction time and errors
Key Achievements
Innovative Architecture
Designed hybrid pipeline combining regex discovery, deep learning detection, and custom post-processing
Dataset Creation
Built comprehensive annotated dataset for financial table extraction
High Accuracy
Achieved >95% accuracy on critical financial data extraction
Research Publication
Published results validating innovative approach in scientific paper
Murex Trading System Reconciliation Clustering
Root Cause Analysis Automation
Situation & Challenge
Trading system reconciliation projects typically involved 20+ business analysts over 2 years. The challenge was to cluster mismatches with the same root cause:
- Existing solution had computational complexity preventing timely results
- Large dataset: 130,000+ transactions (65,000 mismatches, 48 features)
- Highly imbalanced data with repetitive root causes and duplicates
- Need for scalable solution to handle growing data volumes
Task & Responsibilities
My main task was to reduce solution complexity and achieve results in reasonable time:
- Data preprocessing and quality improvement
- Feature reduction and dataset optimization
- Clustering algorithm implementation and evaluation
- Performance comparison and impact analysis
- Development of reproducible methodology
Action & Approach
I implemented a comprehensive data reduction and clustering strategy:
- Data Preprocessing:
- Defined unique transaction keys by concatenating Murex number and operation type
- Removed rows with missing values, inconsistent duplicates, and transactions without root cause
- Data Reduction:
- Duplicate removal (reduction from 65,273 to 19,095 observations)
- Custom sampling with max_count threshold per cluster
- Elimination of low-variance or single-value features
- Clustering Implementation:
- Implemented FuzzyART and DBSCAN with parameter tuning
- Used metrics: Adjusted Rand Score, purity, detectability
- Validated by projecting clusters onto complete dataset
Results & Impact
- Significant Data Reduction: Up to 96% feature reduction while preserving 97% of root causes
- Improved Performance: FuzzyART achieved 0.65-0.67 Adjusted Rand Score
- Method Validation: Demonstrated that removing duplicates and uninformative features doesn't harm clustering quality
- Reproducible Methodology: Validated approach on simulated data with various noise levels
- Scalable Solution: Enabled efficient and scalable root cause detection in industrial context
Key Achievements
Data Optimization
Reduced dataset by 96% while maintaining 97% of critical information
Algorithm Performance
Achieved high-quality clustering with 0.67 Adjusted Rand Score
Computational Efficiency
Enabled timely results through intelligent data reduction
Industrial Scalability
Developed methodology applicable to large-scale industrial data
Multimodal Tree Species Recognition System
LISTIC Lab, Annecy - Mobile AI Application for Botany
Situation & Challenge
Tree species recognition presents significant challenges due to:
- High diversity of tree species in nature
- Interspecies similarity and intra-species variability
- Confusions during recognition caused by species similarities
- Need for offline mobile applications accessible to everyone
- Existing solutions achieving only 56% accuracy
Task & Objectives
The project had two main contributions:
- Intelligent Decision System: Develop a system emulating botanist expertise using belief functions theory to reduce confusion and improve accuracy
- Mobile Solution: Create a practical smartphone application working offline, adapted to memory and computation limits
- Accessibility: Make tree species recognition accessible and easy to use for everyone in nature
Action & Technical Approach
I developed an innovative two-step multimodal recognition approach:
- Step 1: Leaf Identification
- Used to reduce problem dimensionality
- Identifies subset of most probable species
- Leverages leaf morphology and texture features
- Step 2: Bark Refinement
- Modified evidential k-Nearest Neighbors (EkNN) algorithm
- Recognizes bark from first step output
- Belief functions theory for reasoning with uncertainty
- Mobile Optimization
- Designed for offline smartphone use
- Optimized for memory and computation constraints
- Lightweight model architecture
- Experimental Validation
- Conducted experiments on real-world data
- Compared against existing solutions
- Validated accuracy improvements
Results & Impact
- Significant Accuracy Improvement: Increased recognition accuracy from 56% to 75% (19% absolute gain)
- Superior Performance: Outperformed state-of-the-art methods by over 20%
- Confusion Reduction: Belief functions theory effectively reduced confusion between similar species
- Mobile-Ready Solution: Developed application working offline on smartphones
- Scientific Contribution: Published in Expert Systems with Applications journal (9 citations)
- Practical Application: Enabled non-experts to identify tree species using their smartphones
Key Achievements
Accuracy Breakthrough
Improved recognition accuracy by 19% absolute (56% → 75%)
Innovative Methodology
Developed novel two-step approach using belief functions theory
Mobile Innovation
Created first offline tree recognition app for smartphones
Research Impact
Published in top-tier journal (Expert Systems with Applications)
Influence Maximization in Social Networks
PhD Dissertation - Evidence Theory Applications
Situation & Challenge
For viral marketing in social networks, companies needed to identify the most influential users, but existing solutions:
- Relied mainly on network structure, ignoring user opinions
- Were not robust to data uncertainty in social networks
- Could not target different marketing scenarios based on influencer and audience opinions
- Lacked theoretical foundations for handling uncertainty
Task & Objectives
Develop new influence maximization models that:
- Consider multiple influence aspects (network position, activity, opinion)
- Are robust to social network data uncertainty
- Enable targeting of different marketing scenarios
- Surpass existing model performance in influencer quality
Action & Research Approach
Implemented comprehensive research methodology:
- Innovative Modeling:
- Developed two influence maximization models based on belief function theory
- Created seven different influence measures for three marketing scenarios
- Introduced evidential influence measure combining network position, message popularity, and user activity
- Rigorous Experimentation:
- Collected and processed real Twitter dataset (36,274 users, 251,329 tweets)
- Implemented complete opinion estimation pipeline using SentiWordNet and POS taggers
- Conducted systematic comparisons with state-of-the-art models
- Developed generated dataset for algorithm accuracy evaluation
- Technical Optimization:
- Implemented CELF algorithm for efficient maximization
- Ensured solution scalability for large networks
- Validated theoretical properties (monotonicity, submodularity) of objective functions
Results & Contributions
- Technical Performance: 80%+ precision improvement on generated data
- Computational Efficiency: 32-536 ms vs several minutes/hours for classical approaches
- Quality Improvement: Detected influencers with 85% positive opinion vs 41% for existing models
- Theoretical Contribution: Novel application of evidence theory to social network analysis
- Research Impact: Multiple publications in top-tier journals and conferences
Key Achievements
Theoretical Innovation
First application of belief function theory to influence maximization
Performance Breakthrough
Achieved 80%+ precision improvement over state-of-the-art
Real-world Dataset
Built and analyzed comprehensive Twitter dataset
Research Recognition
Published in Knowledge-Based Systems (66+ citations)