Professional Experiences - Dr. Siwar Jendoubi

Upskills R&D | 2023-2024

Privacy-Preserving RAG System for Financial Contracts

On-premise AI Solution with Advanced Document Intelligence

70% Time Reduction

90%+ Accuracy

100% Data Privacy

Situation & Privacy Challenge

Financial institutions require AI solutions that balance performance with data privacy:

Data Sensitivity: Financial contracts contain highly confidential information
Performance Needs: 4+ hour manual review per complex contract
Privacy Constraints: Prohibition of external API calls for sensitive data
Accuracy Requirements: Need for >90% accuracy in legal document analysis

Task & Technical Objectives

Develop a fully on-premise RAG system meeting strict privacy requirements:

Privacy-First Architecture: 100% on-premise deployment, no external APIs
Open-Source LLM Integration: Implement Llama 3 for local inference
Advanced Chunking: Semantic chunking preserving document structure
Optimized Retrieval: Implement re-ranking for precision improvement
Maximum Efficiency: Target 70%+ reduction in processing time

Action & Technical Innovations

🔒 Privacy-Preserving Architecture

Full On-Premise Deployment: Complete control over data flow
Llama 3 Integration: 8B parameter model running locally
Air-Gapped Capability: Operates without internet connectivity
Encrypted Storage: Vector database with at-rest encryption

📄 Advanced Document Processing

Semantic Chunking: Intelligent segmentation preserving:
- Legal document structure (sections, subsections)
- Financial table integrity
- Cross-reference preservation
- Hierarchical relationships
Context-Aware Processing:
- Document type detection (NDA, MSA, SOW, etc.)
- Party identification and role extraction
- Jurisdiction and governing law detection

⚡ Optimized Retrieval Pipeline

Multi-Stage Retrieval:
1. Initial Retrieval: Dense embeddings with sentence transformers
2. Re-ranking: Cross-encoder models for precision improvement
3. Context Expansion: Adjacent chunk inclusion for coherence
Performance Optimizations:
- Batch processing for parallel inference
- Model quantization for reduced memory footprint
- Cache layer for frequent queries

Key Technical Innovations:

Developed document-structure-aware chunking algorithm
Built multi-modal re-ranker combining semantic and lexical signals
Designed privacy audit trail for compliance reporting

Results & Impact

70%

Reduction in contract review time

90%+

Accuracy in clause extraction

100%

On-premise data privacy

Unprecedented Efficiency: Contract review time reduced from 4 hours to 72 minutes (70% reduction)
Enterprise-Grade Privacy: Full compliance with financial regulations and data sovereignty requirements
Superior Accuracy: 90%+ accuracy achieved through re-ranking and domain adaptation
Cost Optimization: Eliminated external API costs while maintaining performance

📈 Performance Benchmarks vs Alternatives

Metric	Our Solution	GPT-4 API	Manual Review
Accuracy	92%	94%	85%
Data Privacy	100%	0%	100%
Cost per Document	$0	$2.50	$200

Llama 3 RAG Architecture Semantic Chunking Re-ranking On-premise AI FastAPI Vector Databases Document Intelligence Privacy by Design Model Quantization

Key Achievements

Privacy Innovation

Delivered enterprise-grade privacy with 100% on-premise deployment

Performance Excellence

Achieved 70% time reduction while maintaining 90%+ accuracy

Technical Leadership

Successfully implemented Llama 3 with advanced RAG optimizations

Business Impact

Enabled secure AI adoption in highly regulated financial environment

System Architecture Overview

Document Ingestion

PDF/OCR, parsing

Semantic Chunking

Structure-aware

Vector Storage

Local embeddings

Retrieval & Re-rank

Multi-stage

Llama 3 Generation

On-premise

Upskills R&D | 2022-2023

Intelligent Email Classification System

DBS Banking - Multi-label classification for customer service optimization

90%+ F1 Score

70→90% Improvement

Situation & Challenge

DBS banking teams were receiving hundreds of emails daily with varying urgency levels based on email types/labels. The manual classification process was:

Time-consuming and inconsistent across different labelers
Prone to human error in identifying urgent vs non-urgent emails
Unable to handle the volume efficiently (multiple labels per email)

Task & Responsibilities

My primary responsibilities included:

Design and implementation of NLP preprocessing pipelines
Fine-tuning of deep learning models for multi-label classification
Comprehensive evaluation and analysis of model performance
Data quality investigation and process improvement recommendations

Action & Approach

Initial results showed poor model performance (~70% F1). Through deep data analysis, I discovered a critical issue:

Inconsistency Discovery: Comparing emails with identical labels revealed significant variations when labeled by different people
Process Intervention: Requested unified labeling standards and attended labeling sessions to understand data better
Improved Preprocessing: Enhanced data cleaning and feature engineering based on labeling insights
Model Optimization: Implemented advanced NLP techniques and fine-tuned models with improved data

Results & Impact

Performance Leap: Improved F1 score from 70% to over 90%
Process Standardization: Established consistent labeling protocols across teams
Efficiency Gain
Scalable Solution: System capable of handling growing email volumes

NLP Deep Learning Python Multi-label Classification Data Analysis Transfer Learning

Key Achievements

Data Quality Insight

Identified critical labeling inconsistencies that were degrading model performance

Process Improvement

Implemented standardized labeling protocols that improved data quality

Performance Optimization

Achieved >20% improvement in classification accuracy

Business Impact

Enabled faster email processing and better customer service response times

Upskills R&D | 2021-2022

Automated Financial Table Extraction System

Financial Spreading Automation - PDF Report Processing

96% Detection Rate

95%+ Accuracy

Situation & Challenge

Manual extraction of financial tables (balance sheets, income statements, cash flow) from annual PDF reports was a major bottleneck in financial spreading processes:

Time-consuming manual process prone to human errors

Existing table extraction solutions not adapted to financial document complexity

No standardized templates across different companies

Multi-page tables with complex structures

Task & Responsibilities

My mission was to design, implement, and evaluate a robust automated financial table extraction pipeline:

Analyze and evaluate state-of-the-art table extraction technologies

Define the overall solution architecture and pipeline

Create and annotate a reference dataset for training and evaluation

Develop post-processing algorithms for table correction and restructuring

Conduct comprehensive system performance evaluation

Action & Approach

I developed an innovative hybrid architecture combining multiple approaches:

State-of-the-art Analysis: Evaluated existing tools (Camelot, Tabula) and deep learning models (CascadeTabNet, TableNet)

Hybrid Pipeline Design:

Discovery step using regex to identify pages containing target financial tables

YOLOv3 model optimized with FinTabNet training for precise table region detection

Custom post-processing module for cleaning extraction artifacts and reconstructing logical table structure

Dataset Creation: Supervised creation of manually annotated dataset (291 tables from 100 annual reports)

Custom Metrics: Developed ExactMatchSim metric and multi-criteria evaluation methodology

Results & Impact

High Detection Rates: >95% detection for single-page tables, >96% overall document detection

Excellent Accuracy: 94% column similarity, 92.1% row similarity, 95.7% number extraction accuracy

Research Contribution: Created reference dataset and published results in scientific paper

Business Value: Dramatically reduced manual extraction time and errors

Computer Vision YOLOv3 Deep Learning Table Extraction PDF Processing Financial Analysis

Key Achievements

Innovative Architecture

Designed hybrid pipeline combining regex discovery, deep learning detection, and custom post-processing

Dataset Creation

Built comprehensive annotated dataset for financial table extraction

High Accuracy

Achieved >95% accuracy on critical financial data extraction

Research Publication

Published results validating innovative approach in scientific paper

Upskills R&D | 2020-2021

Murex Trading System Reconciliation Clustering

Root Cause Analysis Automation

96% Feature Reduction

0.67 Rand Score

Situation & Challenge

Trading system reconciliation projects typically involved 20+ business analysts over 2 years. The challenge was to cluster mismatches with the same root cause:

Existing solution had computational complexity preventing timely results

Large dataset: 130,000+ transactions (65,000 mismatches, 48 features)

Highly imbalanced data with repetitive root causes and duplicates

Need for scalable solution to handle growing data volumes

Task & Responsibilities

My main task was to reduce solution complexity and achieve results in reasonable time:

Data preprocessing and quality improvement

Feature reduction and dataset optimization

Clustering algorithm implementation and evaluation

Performance comparison and impact analysis

Development of reproducible methodology

Action & Approach

I implemented a comprehensive data reduction and clustering strategy:

Data Preprocessing:

Defined unique transaction keys by concatenating Murex number and operation type

Removed rows with missing values, inconsistent duplicates, and transactions without root cause

Data Reduction:

Duplicate removal (reduction from 65,273 to 19,095 observations)

Custom sampling with max_count threshold per cluster

Elimination of low-variance or single-value features

Clustering Implementation:

Implemented FuzzyART and DBSCAN with parameter tuning

Used metrics: Adjusted Rand Score, purity, detectability

Validated by projecting clusters onto complete dataset

Results & Impact

Significant Data Reduction: Up to 96% feature reduction while preserving 97% of root causes

Improved Performance: FuzzyART achieved 0.65-0.67 Adjusted Rand Score

Method Validation: Demonstrated that removing duplicates and uninformative features doesn't harm clustering quality

Reproducible Methodology: Validated approach on simulated data with various noise levels

Scalable Solution: Enabled efficient and scalable root cause detection in industrial context

Clustering FuzzyART DBSCAN Data Reduction Feature Engineering Big Data

Key Achievements

Data Optimization

Reduced dataset by 96% while maintaining 97% of critical information

Algorithm Performance

Achieved high-quality clustering with 0.67 Adjusted Rand Score

Computational Efficiency

Enabled timely results through intelligent data reduction

Industrial Scalability

Developed methodology applicable to large-scale industrial data

Postdoctoral Researcher | 2017-2019

Multimodal Tree Species Recognition System

LISTIC Lab, Annecy - Mobile AI Application for Botany

56%→75% Accuracy Gain

+20% vs State-of-the-art

Situation & Challenge

Tree species recognition presents significant challenges due to:

High diversity of tree species in nature

Interspecies similarity and intra-species variability

Confusions during recognition caused by species similarities

Need for offline mobile applications accessible to everyone

Existing solutions achieving only 56% accuracy

Task & Objectives

The project had two main contributions:

Intelligent Decision System: Develop a system emulating botanist expertise using belief functions theory to reduce confusion and improve accuracy

Mobile Solution: Create a practical smartphone application working offline, adapted to memory and computation limits

Accessibility: Make tree species recognition accessible and easy to use for everyone in nature

Action & Technical Approach

I developed an innovative two-step multimodal recognition approach:

Step 1: Leaf Identification

Used to reduce problem dimensionality

Identifies subset of most probable species

Leverages leaf morphology and texture features

Step 2: Bark Refinement

Modified evidential k-Nearest Neighbors (EkNN) algorithm

Recognizes bark from first step output

Belief functions theory for reasoning with uncertainty

Mobile Optimization

Designed for offline smartphone use

Optimized for memory and computation constraints

Lightweight model architecture

Experimental Validation

Conducted experiments on real-world data

Compared against existing solutions

Validated accuracy improvements

Results & Impact

Significant Accuracy Improvement: Increased recognition accuracy from 56% to 75% (19% absolute gain)

Superior Performance: Outperformed state-of-the-art methods by over 20%

Confusion Reduction: Belief functions theory effectively reduced confusion between similar species

Mobile-Ready Solution: Developed application working offline on smartphones

Scientific Contribution: Published in Expert Systems with Applications journal (9 citations)

Practical Application: Enabled non-experts to identify tree species using their smartphones

Computer Vision Evidence Theory Mobile AI Multimodal Fusion EkNN Algorithm Image Processing

Key Achievements

Accuracy Breakthrough

Improved recognition accuracy by 19% absolute (56% → 75%)

Innovative Methodology

Developed novel two-step approach using belief functions theory

Mobile Innovation

Created first offline tree recognition app for smartphones

Research Impact

Published in top-tier journal (Expert Systems with Applications)

PhD Research | 2013-2016

Influence Maximization in Social Networks

PhD Dissertation - Evidence Theory Applications

+80% Precision Gain

85% Positive Opinion

Situation & Challenge

For viral marketing in social networks, companies needed to identify the most influential users, but existing solutions:

Relied mainly on network structure, ignoring user opinions

Were not robust to data uncertainty in social networks

Could not target different marketing scenarios based on influencer and audience opinions

Lacked theoretical foundations for handling uncertainty

Task & Objectives

Develop new influence maximization models that:

Consider multiple influence aspects (network position, activity, opinion)

Are robust to social network data uncertainty

Enable targeting of different marketing scenarios

Surpass existing model performance in influencer quality

Action & Research Approach

Implemented comprehensive research methodology:

Innovative Modeling:

Developed two influence maximization models based on belief function theory

Created seven different influence measures for three marketing scenarios

Introduced evidential influence measure combining network position, message popularity, and user activity

Rigorous Experimentation:

Collected and processed real Twitter dataset (36,274 users, 251,329 tweets)

Implemented complete opinion estimation pipeline using SentiWordNet and POS taggers

Conducted systematic comparisons with state-of-the-art models

Developed generated dataset for algorithm accuracy evaluation

Technical Optimization:

Implemented CELF algorithm for efficient maximization

Ensured solution scalability for large networks

Validated theoretical properties (monotonicity, submodularity) of objective functions

Results & Contributions

Technical Performance: 80%+ precision improvement on generated data

Computational Efficiency: 32-536 ms vs several minutes/hours for classical approaches

Quality Improvement: Detected influencers with 85% positive opinion vs 41% for existing models

Theoretical Contribution: Novel application of evidence theory to social network analysis

Research Impact: Multiple publications in top-tier journals and conferences

Evidence Theory Social Network Analysis Influence Maximization Machine Learning Twitter API CELF Algorithm

Key Achievements

Theoretical Innovation

First application of belief function theory to influence maximization

Performance Breakthrough

Achieved 80%+ precision improvement over state-of-the-art

Real-world Dataset

Built and analyzed comprehensive Twitter dataset

Research Recognition

Published in Knowledge-Based Systems (66+ citations)

View Publications Discuss Collaboration