Financial Data Extraction Pipeline

Extract financial data from PDFs in Google Drive using Gemini AI

Back

Workflow Information

ID: financial_data_extraction_workflow_v2

Namespace: default

Version: 1.0

Created: 2025-08-01

Updated: 2025-08-01

Tasks: 3

Quick Actions

Manage Secrets

Inputs

Name	Type	Required	Default
`folder_id`	string	Required	`1W22-59ESyR-E_1PMVWevzL-WvlFALDl-`
`gemini_api_key`	string	Required	`AIzaSyB0_e6aU4gF-qRapMm3UYBSITpbd0ehsYk`
`nango_connection_id`	string	Required	`4274993f-c614-4efa-a01e-8d07422f4b09`
`nango_key`	string	Required	`8df3e2de-2307-48d3-94bd-ddd3fd6a62ec`

Outputs

Name	Type	Source
`company_results`	object	Detailed financial data for all processed companies
`extraction_summary`	object	Technical details about extraction process
`processing_summary`	object	Summary of companies processed and extraction success rates
`workflow_execution`	object	Workflow execution metadata and status

Tasks

list_company_folders

script

Get all company folders from the main directory

Python Script:

import requests
import json

print("📁 LISTING COMPANY FOLDERS")
print("=" * 50)
print("")

try:
    # Get credentials directly like working workflow
    nango_connection_id = "${nango_connection_id}"
    nango_key = "${nango_key}"
    folder_id = "${folder_id}"
    
    print(f"📋 Target Folder ID: {folder_id}")
    print("")
    
    # Get access token
    auth_url = f"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h"
    headers = {
        'Authorization': f'Bearer {nango_key}',
        'Content-Type': 'application/json'
    }
    
    print("🔐 Getting access token...")
    auth_response = requests.get(auth_url, headers=headers, timeout=10)
    auth_response.raise_for_status()
    access_token = auth_response.json()['credentials']['access_token']
    print("✅ Access token obtained")
    
    # List folders in main directory
    drive_headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    
    query = f"'{folder_id}' in parents and mimeType='application/vnd.google-apps.folder' and trashed=false"
    params = {"q": query, "fields": "files(id, name)", "pageSize": 100}
    
    print("📂 Fetching company folders...")
    response = requests.get("https://www.googleapis.com/drive/v3/files", headers=drive_headers, params=params)
    response.raise_for_status()
    
    folders = response.json().get("files", [])
    
    # Limit to first 3 companies to avoid timeout
    limited_folders = folders[:3]
    
    print(f"✅ Found {len(folders)} total companies")
    print(f"📊 Processing {len(limited_folders)} companies (limited for demo)")
    
    for i, folder in enumerate(limited_folders, 1):
        print(f"   {i}. {folder['name']} (ID: {folder['id']})")
    
    outputs = {
        "folders": limited_folders,
        "total_companies": len(folders),
        "processing_companies": len(limited_folders),
        "access_token": access_token,  # Pass token to loop tasks
        "status": "success"
    }
    
    print("")
    print(f"__OUTPUTS__ {json.dumps(outputs)}")
    
except Exception as e:
    print(f"❌ Error listing company folders: {str(e)}")
    outputs = {"status": "failed", "error": str(e)}
    print(f"__OUTPUTS__ {json.dumps(outputs)}")

process_companies_loop

loop

Process multiple companies using loop with iteration limit

Loop Configuration

Type: for
Max Iterations: 1

Iterator Variable: company_index
State Variables: total_extractions, processed_companies, successful_extractions

Loop Flow (2 steps)

Process Current Company script

import requests
import json
import logging

logger = logging.getLogger(__name__)

try:
    folder_data = ${list_company_folders}
    company_idx = ${company_index}
    
    print(f"🔍 DEBUG: folder data {folder_data}")
    print(f"🔍 DEBUG: Loop iteration {company_idx}")
    print(f"📁 Folder data status: {folder_data.get('status', 'unknown')}")
    print(f"📊 Number of folders found: {len(folder_data.get('folders', []))}")
    print(f"📋 Company index: {company_idx}")
    
    # Check if we have the required data (folders and access_token) instead of just status
    if not folder_data.get("folders") or not folder_data.get("access_token"):
        outputs = {"status": "skipped", "reason": f"Missing required data: folders={bool(folder_data.get('folders'))}, access_token={bool(folder_data.get('access_token'))}"}
        print(f"__OUTPUTS__ {json.dumps(outputs)}")
        exit()
    
    if company_idx >= len(folder_data["folders"]):
        outputs = {"status": "skipped", "reason": f"Company index {company_idx} >= {len(folder_data['folders'])} companies"}
        print(f"__OUTPUTS__ {json.dumps(outputs)}")
        exit()
    
    company = folder_data["folders"][company_idx]
    company_name = company["name"]
    company_id = company["id"]
    
    # Get fresh access token (tokens might expire during processing)
    print(f"🔐 DEBUG: Getting fresh access token for file processing...")
    nango_connection_id = "${nango_connection_id}"
    nango_key = "${nango_key}"
    
    auth_url = f"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h"
    auth_headers = {
        'Authorization': f'Bearer {nango_key}',
        'Content-Type': 'application/json'
    }
    
    auth_response = requests.get(auth_url, headers=auth_headers, timeout=10)
    auth_response.raise_for_status()
    access_token = auth_response.json()['credentials']['access_token']
    print(f"✅ DEBUG: Fresh access token obtained")
    
    base_url = "https://www.googleapis.com/drive/v3"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    
    # List FY folders in company
    query = f"'{company_id}' in parents and mimeType='application/vnd.google-apps.folder' and trashed=false"
    params = {"q": query, "fields": "files(id, name)", "pageSize": 10}
    
    response = requests.get(f"{base_url}/files", headers=headers, params=params)
    response.raise_for_status()
    
    fy_folders = response.json().get("files", [])
    
    outputs = {
        "company_name": company_name,
        "company_id": company_id,
        "fy_folders": fy_folders[:2],  # Limit to 2 FY per company
        "total_fy_folders": len(fy_folders),
        "company_index": company_idx,
        "status": "success"
    }
    
    logger.info(f"[{company_idx}] Company {company_name}: Found {len(fy_folders)} FY folders")
    print(f"__OUTPUTS__ {json.dumps(outputs)}")
    
except Exception as e:
    logger.error(f"Error processing company {company_idx}: {str(e)}")
    outputs = {"status": "failed", "error": str(e), "company_index": company_idx}
    print(f"__OUTPUTS__ {json.dumps(outputs)}")

Process FY Folders for Current Company script

import requests
import json
import logging
import io
import time
from datetime import datetime
from google import genai

logger = logging.getLogger(__name__)

try:
    # Debug: Print all available inputs
    print(f"🔍 DEBUG: All available inputs: {list(inputs.keys())}")
    print(f"🔍 DEBUG: Loop state keys: {list(loop_state.keys())}")
    
    # Access company data fields directly from inputs (loop executor flattens task outputs)
    company_name = inputs.get('company_name', '')
    company_id = inputs.get('company_id', '')
    fy_folders = inputs.get('fy_folders', [])
    task_status = inputs.get('status', '')
    
    # Access folder data from workflow inputs (available in loop context)
    # In loop context, list_company_folders comes as string, need to parse it
    folder_data_raw = inputs.get('list_company_folders', {})
    if isinstance(folder_data_raw, str):
        folder_data = json.loads(folder_data_raw)
    else:
        folder_data = folder_data_raw
    
    print(f"🔍 DEBUG: Company name: {company_name}")
    print(f"🔍 DEBUG: Company ID: {company_id}")
    print(f"🔍 DEBUG: FY folders: {fy_folders}")
    print(f"🔍 DEBUG: Task status: {task_status}")
    
    if task_status != "success" or not fy_folders:
        print(f"🔍 DEBUG: Skipping company - task_status: {task_status}, fy_folders: {len(fy_folders) if fy_folders else 0}")
        outputs = {"status": "skipped", "reason": "No FY folders to process for this company"}
        print(f"__OUTPUTS__ {json.dumps(outputs)}")
        exit()
    
    # company_name, company_id, fy_folders already extracted above
    fy_folders = fy_folders[:2]  # Process max 2 FY folders
    
    # Initialize Gemini client
    gemini_api_key = "${gemini_api_key}"
    client = genai.Client(api_key=gemini_api_key)
    model_id = "gemini-2.5-flash"
    
    # Get fresh access token for file downloads (prevent expiration issues)
    print(f"🔐 DEBUG: Getting fresh access token for file downloads...")
    nango_connection_id = "${nango_connection_id}"
    nango_key = "${nango_key}"
    
    auth_url = f"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h"
    auth_headers = {
        'Authorization': f'Bearer {nango_key}',
        'Content-Type': 'application/json'
    }
    
    auth_response = requests.get(auth_url, headers=auth_headers, timeout=10)
    auth_response.raise_for_status()
    access_token = auth_response.json()['credentials']['access_token']
    print(f"✅ DEBUG: Fresh access token obtained for downloads")
    
    base_url = "https://www.googleapis.com/drive/v3"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    
    company_extractions = []
    
    print(f"🔍 DEBUG: Starting FY processing for {company_name}")
    print(f"🔍 DEBUG: FY folders to process: {len(fy_folders)}")
    
    for fy_idx, fy_folder in enumerate(fy_folders):
        try:
            fy_name = fy_folder["name"]
            fy_id = fy_folder["id"]
            
            print(f"🔍 DEBUG: Processing FY folder {fy_idx + 1}/{len(fy_folders)}: {fy_name}")
            logger.info(f"Processing {company_name} - {fy_name}")
            
            # Get all items in FY folder (including sub-folders)
            query = f"'{fy_id}' in parents and trashed=false"
            params = {
                "q": query, 
                "fields": "files(id, name, mimeType, size)", 
                "pageSize": 20
            }
            
            response = requests.get(f"{base_url}/files", headers=headers, params=params)
            response.raise_for_status()
            
            all_items = response.json().get("files", [])
            all_files = []
            
            # Look for actual files, and also search inside sub-folders
            for item in all_items:
                print(f"🔍 DEBUG: Found item: {item['name']} (MIME: {item['mimeType']})")
                
                if item['mimeType'] == 'application/vnd.google-apps.folder':
                    # This is a folder, search inside it for actual files
                    print(f"📁 DEBUG: Searching inside folder: {item['name']}")
                    sub_query = f"'{item['id']}' in parents and trashed=false"
                    sub_params = {
                        "q": sub_query, 
                        "fields": "files(id, name, mimeType, size)", 
                        "pageSize": 20
                    }
                    
                    sub_response = requests.get(f"{base_url}/files", headers=headers, params=sub_params)
                    sub_response.raise_for_status()
                    sub_files = sub_response.json().get("files", [])
                    
                    print(f"📄 DEBUG: Found {len(sub_files)} items inside {item['name']}")
                    all_files.extend(sub_files)
                else:
                    # This is a direct file
                    all_files.append(item)
            
            # Filter for PDF files and relevant documents  
            print(f"📊 DEBUG: Total files found after folder exploration: {len(all_files)}")
            pdf_files = []
            for file in all_files:
                print(f"🔍 DEBUG: Checking file: {file['name']} (MIME: {file['mimeType']})")
                
                # Only include actual PDF files or Google Apps documents (but NOT folders)
                if (file["name"].lower().endswith(".pdf") or 
                    file["mimeType"] == "application/pdf"):
                    pdf_files.append({
                        "id": file["id"],
                        "name": file["name"],
                        "mimeType": file["mimeType"]
                    })
                    print(f"✅ DEBUG: Added PDF file: {file['name']}")
                elif ("application/vnd.google-apps" in file["mimeType"] and 
                      file["mimeType"] != "application/vnd.google-apps.folder"):
                    # Include Google Apps documents but NOT folders
                    pdf_files.append({
                        "id": file["id"],
                        "name": file["name"],
                        "mimeType": file["mimeType"]
                    })
                    print(f"📄 DEBUG: Added Google Apps file: {file['name']} ({file['mimeType']})")
                else:
                    print(f"⏭️  DEBUG: Skipped file: {file['name']} (unsupported type: {file['mimeType']})")
            
            # Limit to first 3 files to avoid size limits
            pdf_files = pdf_files[:3]
            
            print(f"🔍 DEBUG: Found {len(all_files)} total files, {len(pdf_files)} PDF files in {fy_name}")
            
            if not pdf_files:
                print(f"⚠️  DEBUG: No PDF files found in {company_name} - {fy_name}")
                logger.warning(f"No PDF files found in {company_name} - {fy_name}")
                continue
            
            # Upload files to Gemini
            uploaded_files = []
            
            print(f"🔍 DEBUG: Starting file upload for {len(pdf_files)} files")
            
            for pdf_file in pdf_files:
                try:
                    file_id = pdf_file["id"]
                    file_name = pdf_file["name"]
                    
                    print(f"🔍 DEBUG: Processing file: {file_name} (ID: {file_id})")
                    
                    # Download file content (limited size)
                    print(f"🔍 DEBUG: File MIME type: {pdf_file['mimeType']}")
                    
                    if "application/vnd.google-apps" in pdf_file["mimeType"]:
                        # For Google Apps files, try to export as PDF
                        if "spreadsheet" in pdf_file["mimeType"]:
                            export_mime = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
                        elif "document" in pdf_file["mimeType"]:
                            export_mime = "application/pdf"
                        elif "presentation" in pdf_file["mimeType"]:
                            export_mime = "application/pdf"
                        else:
                            export_mime = "application/pdf"
                        
                        download_url = f"{base_url}/files/{file_id}/export"
                        params = {"mimeType": export_mime}
                        print(f"🔍 DEBUG: Using export endpoint with MIME: {export_mime}")
                    else:
                        # For regular files, use direct download
                        download_url = f"{base_url}/files/{file_id}"
                        params = {"alt": "media"}
                        print(f"🔍 DEBUG: Using direct download endpoint")
                    
                    print(f"🔍 DEBUG: Downloading file from: {download_url}")
                    
                    try:
                        response = requests.get(download_url, headers=headers, params=params, 
                                              stream=True, timeout=30)
                        response.raise_for_status()
                        print(f"🔍 DEBUG: Download successful, status: {response.status_code}")
                    except requests.exceptions.HTTPError as e:
                        if response.status_code == 403:
                            print(f"⚠️  DEBUG: 403 Forbidden - trying alternative download method")
                            # For Google Apps files that fail export, skip them for now
                            if "application/vnd.google-apps" in pdf_file["mimeType"]:
                                print(f"⚠️  DEBUG: Skipping Google Apps file due to export permissions: {file_name}")
                                continue
                            else:
                                raise e
                        else:
                            raise e
                    
                    # Download complete file content (no size limit for testing)
                    content = response.content  # Download complete file
                    print(f"🔍 DEBUG: Downloaded complete file: {len(content)} bytes")
                    
                    if len(content) == 0:
                        print(f"⚠️  DEBUG: File {file_name} has no content, skipping")
                        continue
                    
                    # Validate PDF format
                    if not content.startswith(b'%PDF-'):
                        print(f"⚠️  DEBUG: File {file_name} doesn't appear to be a valid PDF, skipping")
                        continue
                    
                    print(f"✅ DEBUG: PDF validation passed for {file_name}")
                    
                    print(f"🔍 DEBUG: Content size: {len(content)} bytes, uploading to Gemini...")
                    # Upload to Gemini with explicit MIME type
                    try:
                        file = client.files.upload(
                            file=io.BytesIO(content),
                            config={
                                "display_name": file_name,
                                "mime_type": "application/pdf"  # Explicitly set PDF MIME type
                            }
                        )
                        print(f"✅ DEBUG: File uploaded to Gemini: {file.name}")
                        
                        # Wait for Gemini to process the complete file
                        time.sleep(10)  # Increased wait time for larger files
                        
                        # Verify file was processed correctly
                        uploaded_file_info = client.files.get(name=file.name)
                        print(f"🔍 DEBUG: Gemini file state: {uploaded_file_info.state}")
                        print(f"🔍 DEBUG: Gemini detected MIME: {uploaded_file_info.mime_type}")
                        
                        # Check if file is ready for processing
                        if uploaded_file_info.state.name != "ACTIVE":
                            print(f"⚠️  DEBUG: File {file_name} not in ACTIVE state: {uploaded_file_info.state.name}")
                            time.sleep(2)  # Additional wait
                            uploaded_file_info = client.files.get(name=file.name)
                            print(f"🔍 DEBUG: File state after wait: {uploaded_file_info.state.name}")
                        
                    except Exception as upload_error:
                        print(f"❌ DEBUG: Gemini upload failed: {str(upload_error)}")
                        continue
                    
                    uploaded_files.append({
                        "name": file.name,
                        "display_name": file_name,
                        "size": len(content)
                    })
                    
                    logger.info(f"Uploaded {file_name} to Gemini ({len(content)} bytes)")
                    time.sleep(1)  # Rate limiting
                    
                except Exception as e:
                    print(f"❌ DEBUG: Error uploading {pdf_file['name']}: {str(e)}")
                    
                    # Check if it's a permission issue with Google Apps files
                    if "403" in str(e) and "application/vnd.google-apps" in pdf_file["mimeType"]:
                        print(f"⚠️  DEBUG: Google Apps file export requires additional permissions")
                        print(f"⚠️  DEBUG: File '{pdf_file['name']}' skipped due to insufficient export permissions")
                    
                    logger.error(f"Error uploading {pdf_file['name']}: {str(e)}")
                    continue
            
            if not uploaded_files:
                print(f"⚠️  DEBUG: No files uploaded successfully for {company_name} - {fy_name}")
                print(f"📋 DEBUG: Summary - Found {len(pdf_files)} files, but 0 uploaded due to permissions")
                logger.warning(f"No files uploaded successfully for {company_name} - {fy_name}")
                continue
            
            print(f"✅ DEBUG: Successfully uploaded {len(uploaded_files)} files for {company_name} - {fy_name}")
            
            # Get file references for Gemini
            print(f"🔗 DEBUG: Getting Gemini file references...")
            files = []
            for file_info in uploaded_files:
                gemini_file = client.files.get(name=file_info["name"])
                files.append(gemini_file)
                print(f"🔗 DEBUG: Got reference for: {file_info['display_name']} -> {gemini_file.name}")
            
            print(f"🚀 DEBUG: Ready to process {len(files)} files with Gemini AI")
            
            # Wait additional time for all files to be fully processed by Gemini
            print(f"⏳ DEBUG: Waiting for all files to be fully processed by Gemini...")
            time.sleep(15)  # Increased wait time for complete PDF processing
            
            # Verify all files are ready for processing
            print(f"🔍 DEBUG: Verifying file readiness...")
            for file in files:
                file_info = client.files.get(name=file.name)
                print(f"📄 DEBUG: File {file.name} state: {file_info.state.name}")
                if file_info.state.name != "ACTIVE":
                    print(f"⚠️  DEBUG: File {file.name} not ready, waiting additional time...")
                    time.sleep(5)
            
            # Extract Balance Sheet data
            print(f"🤖 DEBUG: Starting Gemini extraction for Balance Sheet data...")
            balance_sheet_prompt = f"""
            Extract Balance Sheet data for {company_name} - {fy_name}. Return JSON:
            {{
              "assets": {{"total_assets": null, "current_assets": null, "fixed_assets": null}},
              "liabilities": {{"total_liabilities": null, "current_liabilities": null}},
              "equity": {{"shareholders_funds": null}},
              "financial_year": "{fy_name}"
            }}
            All figures in Rs. Million. Use null for missing values.
            """
            
            print(f"🤖 DEBUG: Sending Balance Sheet prompt to Gemini...")
            balance_response = client.models.generate_content(
                model=model_id,
                contents=[balance_sheet_prompt] + files,
                config={"response_mime_type": "application/json"}
            )
            
            print(f"🤖 DEBUG: Gemini Balance Sheet response: {balance_response.text[:200]}...")
            balance_sheet = json.loads(balance_response.text)
            print(f"✅ DEBUG: Balance Sheet data extracted successfully")
            
            # Extract P&L data
            print(f"🤖 DEBUG: Starting Gemini extraction for P&L data...")
            pl_prompt = f"""
            Extract P&L data for {company_name} - {fy_name}. Return JSON:
            {{
              "revenue": {{"net_revenue": null, "revenue_growth": null}},
              "profitability": {{"ebitda": null, "net_profit": null, "ebitda_margin": null}},
              "financial_year": "{fy_name}"
            }}
            All figures in Rs. Million. Use null for missing values.
            """
            
            print(f"🤖 DEBUG: Sending P&L prompt to Gemini...")
            pl_response = client.models.generate_content(
                model=model_id,
                contents=[pl_prompt] + files,
                config={"response_mime_type": "application/json"}
            )
            
            print(f"🤖 DEBUG: Gemini P&L response: {pl_response.text[:200]}...")
            pl_data = json.loads(pl_response.text)
            print(f"✅ DEBUG: P&L data extracted successfully")
            
            # Compile extraction results
            extraction_result = {
                "company_name": company_name,
                "fy_name": fy_name,
                "files_processed": len(uploaded_files),
                "balance_sheet": balance_sheet,
                "profit_loss": pl_data,
                "extraction_timestamp": datetime.now().isoformat(),
                "status": "success"
            }
            
            print(f"📊 DEBUG: Extraction result compiled:")
            print(f"   - Company: {company_name}")
            print(f"   - FY: {fy_name}")
            print(f"   - Files processed: {len(uploaded_files)}")
            print(f"   - Balance Sheet extracted: {bool(balance_sheet)}")
            print(f"   - P&L extracted: {bool(pl_data)}")
            
            company_extractions.append(extraction_result)
            logger.info(f"Successfully extracted data for {company_name} - {fy_name}")
            print(f"✅ DEBUG: Added extraction result to company_extractions ({len(company_extractions)} total)")
            
        except Exception as e:
            print(f"❌ DEBUG: Error processing {company_name} - {fy_name}: {str(e)}")
            logger.error(f"Error processing {company_name} - {fy_name}: {str(e)}")
            continue
    
    # Update loop state with results
    current_processed = loop_state.get('processed_companies', [])
    current_processed.append({
        "company_name": company_name,
        "company_id": company_id,
        "fy_extractions": company_extractions,
        "total_fy_processed": len(company_extractions)
    })
    
    current_total = loop_state.get('total_extractions', 0)
    current_successful = loop_state.get('successful_extractions', 0)
    
    # Initialize state updates dictionary
    state_updates = {}
    
    # Update state variables
    state_updates['processed_companies'] = current_processed
    state_updates['total_extractions'] = current_total + len(company_extractions)
    state_updates['successful_extractions'] = current_successful + len(company_extractions)
    
    print(f"🔍 DEBUG: Final results for {company_name}: {len(company_extractions)} extractions completed")
    
    # Create Excel file if we have extractions
    excel_upload_result = None
    if company_extractions:
        try:
            print(f"📊 DEBUG: Creating Excel workbook for {company_name}")
            
            # Import required modules for Excel creation
            from openpyxl import Workbook
            from openpyxl.styles import PatternFill, Font, Alignment, Border, Side
            from openpyxl.utils import get_column_letter
            import pandas as pd
            
            # Create Excel workbook
            wb = Workbook()
            wb.remove(wb.active)  # Remove default sheet
            
            # Define styles
            header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
            header_font = Font(bold=True, color="FFFFFF", size=11)
            metric_fill = PatternFill(start_color="DCE6F1", end_color="DCE6F1", fill_type="solid")
            metric_font = Font(bold=True, size=10)
            data_font = Font(size=10)
            center_align = Alignment(horizontal="center", vertical="center")
            left_align = Alignment(horizontal="left", vertical="center")
            right_align = Alignment(horizontal="right", vertical="center")
            
            thin_border = Border(
                left=Side(style="thin"), right=Side(style="thin"),
                top=Side(style="thin"), bottom=Side(style="thin")
            )
            
            # Helper function to flatten nested dictionaries (mimicking pl_workflow)
            def flatten_dict(d, parent_key="", sep="_"):
                items = []
                for k, v in d.items():
                    new_key = f"{parent_key}{sep}{k}" if parent_key else k
                    if isinstance(v, dict):
                        items.extend(flatten_dict(v, new_key, sep=sep).items())
                    else:
                        items.append((new_key, v))
                return dict(items)
            
            # Get all unique section names from extractions (mimicking pl_workflow approach)
            all_sections = set()
            for fy_data in company_extractions:
                # Our data structure has balance_sheet and profit_loss directly in fy_data
                for key in fy_data.keys():
                    if key not in ["company_name", "fy_name", "files_processed", "extraction_timestamp", "status"]:
                        all_sections.add(key)
            
            print(f"🔍 DEBUG: Found sections to export: {list(all_sections)}")
            
            # Process each section found in extractions
            for section_name in sorted(all_sections):
                ws = wb.create_sheet(title=section_name[:31])
                
                # Collect data from all FYs for this section
                section_data = {}
                financial_years = []
                
                for fy_data in company_extractions:
                    fy_name = fy_data.get("fy_name", "Unknown")
                    
                    if section_name in fy_data:
                        raw_data = fy_data[section_name]
                        
                        # Convert to nested structure like pl_workflow
                        nested_data = {
                            "Financial_Year": fy_name,
                            "note": "All figures in Rs. Million unless otherwise stated"
                        }
                        
                        # Add the actual extracted data
                        if isinstance(raw_data, dict):
                            nested_data.update(raw_data)
                        else:
                            nested_data["extracted_data"] = raw_data
                        
                        # Flatten the nested structure (mimicking pl_workflow)
                        flattened = flatten_dict(nested_data)
                        
                        # Remove Financial_Year to avoid duplicates (like pl_workflow)
                        if "Financial_Year" in flattened:
                            del flattened["Financial_Year"]
                        
                        section_data[fy_name] = flattened
                        financial_years.append(fy_name)
                    else:
                        # Create empty entry if section not found
                        section_data[fy_name] = {"No_Data": "Available"}
                        financial_years.append(fy_name)
                
                if section_data and financial_years:
                    # Create DataFrame with financial years as columns
                    df = pd.DataFrame.from_dict(section_data, orient="columns")
                    df = df.reindex(sorted(df.columns), axis=1)
                    df.reset_index(inplace=True)
                    df.rename(columns={"index": "Metrics"}, inplace=True)
                    df["Metrics"] = df["Metrics"].str.replace("_", " ").str.title()
                    
                    # Write headers
                    headers = ["Metrics"] + list(df.columns[1:])
                    ws.append(headers)
                    
                    # Apply header formatting
                    for col_idx, header in enumerate(headers, 1):
                        cell = ws.cell(row=1, column=col_idx)
                        cell.fill = header_fill
                        cell.font = header_font
                        cell.alignment = center_align
                        cell.border = thin_border
                    
                    # Write data rows
                    for idx, row in df.iterrows():
                        ws.append(row.tolist())
                        
                        metric_cell = ws.cell(row=idx + 2, column=1)
                        metric_cell.fill = metric_fill
                        metric_cell.font = metric_font
                        metric_cell.alignment = left_align
                        metric_cell.border = thin_border
                        
                        for col_idx in range(2, len(headers) + 1):
                            data_cell = ws.cell(row=idx + 2, column=col_idx)
                            data_cell.font = data_font
                            data_cell.alignment = right_align
                            data_cell.border = thin_border
                    
                    # Adjust column widths
                    ws.column_dimensions["A"].width = 40
                    for col_idx in range(2, len(headers) + 1):
                        ws.column_dimensions[get_column_letter(col_idx)].width = 15
                    
                    # Freeze panes
                    ws.freeze_panes = "B2"
                    
                    # Add sheet title
                    ws.insert_rows(1)
                    ws.merge_cells(f"A1:{get_column_letter(len(headers))}1")
                    title_cell = ws["A1"]
                    title_cell.value = f"{section_name} Analysis"
                    title_cell.font = Font(bold=True, size=14, color="FFFFFF")
                    title_cell.fill = PatternFill(start_color="1F4788", end_color="1F4788", fill_type="solid")
                    title_cell.alignment = center_align
                    
                    print(f"✅ DEBUG: Created sheet: {section_name} with {len(df)} metrics")
                else:
                    # Create empty sheet
                    ws.append(["Metrics", "No Data Available"])
                    ws.column_dimensions["A"].width = 40
                    ws.column_dimensions["B"].width = 20
                    print(f"⚠️  DEBUG: No data available for section: {section_name}")
            
            # Save to bytes
            excel_buffer = io.BytesIO()
            wb.save(excel_buffer)
            excel_buffer.seek(0)
            excel_content = excel_buffer.getvalue()
            
            # Upload to Google Drive in the company folder
            excel_filename = f"{company_name}_Financial_Analysis.xlsx"
            print(f"📤 DEBUG: Uploading {excel_filename} to Google Drive...")
            
            # Get fresh access token for upload
            nango_key = "${nango_key}"
            nango_connection_id = "${nango_connection_id}"
            
            print(f"🔍 DEBUG: nango_key = {nango_key}")
            print(f"🔍 DEBUG: nango_connection_id = {nango_connection_id}")
            auth_url = f"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h"
            print(f"🔍 DEBUG: Request URL = {auth_url}")
            
            nango_response = requests.get(
                auth_url,
                headers={"Authorization": f"Bearer {nango_key}", "Content-Type": "application/json"}
            )
            
            print(f"🔍 DEBUG: Response status code = {nango_response.status_code}")
            print(f"🔍 DEBUG: Response text = {nango_response.text}")
            
            if nango_response.status_code == 200:
                fresh_access_token = nango_response.json()["credentials"]["access_token"]
                
                # Upload metadata  
                metadata = {"name": excel_filename, "parents": [company_id]}
                
                files = {
                    "data": ("metadata", json.dumps(metadata), "application/json; charset=UTF-8"),
                    "file": (excel_filename, excel_content, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
                }
                
                headers = {"Authorization": f"Bearer {fresh_access_token}"}
                
                upload_response = requests.post(
                    "https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart",
                    headers=headers,
                    files=files
                )
                
                if upload_response.status_code == 200:
                    upload_result = upload_response.json()
                    excel_upload_result = {
                        "filename": excel_filename,
                        "file_id": upload_result.get('id'),
                        "status": "success"
                    }
                    print(f"✅ DEBUG: Successfully uploaded {excel_filename}")
                    print(f"📋 DEBUG: File ID: {upload_result.get('id')}")
                else:
                    excel_upload_result = {
                        "filename": excel_filename,
                        "error": upload_response.text,
                        "status": "failed"
                    }
                    print(f"❌ DEBUG: Failed to upload {excel_filename}: {upload_response.text}")
            else:
                excel_upload_result = {
                    "error": "Failed to get fresh access token",
                    "status": "failed"
                }
                print(f"❌ DEBUG: Failed to get fresh access token: {nango_response.text}")
            
        except Exception as excel_error:
            excel_upload_result = {
                "error": str(excel_error),
                "status": "failed"
            }
            print(f"❌ DEBUG: Error creating/uploading Excel: {str(excel_error)}")
    
    outputs = {
        "company_name": company_name,
        "extractions_completed": len(company_extractions),
        "excel_upload": excel_upload_result,
        "status": "success"
    }
    
    logger.info(f"Completed processing {company_name}: {len(company_extractions)} extractions")
    print(f"__OUTPUTS__ {json.dumps(outputs)}")
    print(f"__STATE_UPDATES__ {json.dumps(state_updates)}")
    
except Exception as e:
    logger.error(f"Error processing company FY folders: {str(e)}")
    outputs = {"status": "failed", "error": str(e)}
    print(f"__OUTPUTS__ {json.dumps(outputs)}")

final_report

script

Generate comprehensive report from all loop processing results

Python Script:

import json
import logging
from datetime import datetime

logger = logging.getLogger(__name__)

try:
    # Access loop final state results
    loop_results = ${process_companies_loop}
    company_folders = ${list_company_folders}
    
    current_time = datetime.now().isoformat()
    
    # Extract results from loop final state
    loop_final_state = loop_results.get("final_state", {})
    processed_companies = loop_final_state.get("processed_companies", [])
    total_extractions = loop_final_state.get("total_extractions", 0)
    successful_extractions = loop_final_state.get("successful_extractions", 0)
    iterations_completed = loop_results.get("iterations_completed", 0)
    
    # Compile final comprehensive report
    final_report = {
        "workflow_execution": {
            "execution_id": "${execution_id}",
            "workflow_id": "${workflow_id}",
            "completed_at": current_time,
            "status": "completed"
        },
        "processing_summary": {
            "total_companies_found": company_folders.get("total_companies", 0),
            "companies_processed": iterations_completed,
            "max_companies_limit": 3,
            "total_extractions_attempted": total_extractions,
            "successful_extractions": successful_extractions,
            "success_rate": f"{(successful_extractions/total_extractions*100):.1f}%" if total_extractions > 0 else "0%"
        },
        "extraction_summary": {
            "sections_per_extraction": ["balance_sheet", "profit_loss"],
            "total_sections_extracted": successful_extractions * 2,
            "ai_model_used": "gemini-2.5-flash",
            "files_per_fy_limit": 3,
            "file_size_limit_per_file": "400KB"
        },
        "company_results": processed_companies,
        "technical_implementation": {
            "workflow_architecture": "nested_loops",
            "loop_structure": {
                "companies_loop": {
                    "max_iterations": 3,
                    "actual_iterations": iterations_completed
                },
                "fy_folders_loop": {
                    "max_iterations_per_company": 2,
                    "processes_balance_sheet_and_pl": True
                }
            },
            "processing_constraints": {
                "task_timeout_limit": "5 minutes per task",
                "data_transfer_limit": "1MB between tasks",
                "file_download_limit": "400KB per file",
                "concurrent_file_processing": "3 files per FY folder"
            },
            "optimizations_applied": [
                "Nested loop architecture for scalable processing",
                "File size limiting to respect memory constraints", 
                "Streaming file downloads with chunk processing",
                "Consolidated extraction within single task to minimize transfers",
                "State variable tracking across loop iterations"
            ]
        },
        "performance_metrics": {
            "loop_iterations_completed": iterations_completed,
            "average_files_per_company": total_extractions / successful_extractions if successful_extractions > 0 else 0,
            "processing_efficiency": "High - All extractions within single loop task",
            "memory_usage": "Optimized - No large data retention between tasks"
        }
    }
    
    outputs = final_report
    
    logger.info(f"Final workflow report generated successfully")
    logger.info(f"Processed {iterations_completed} companies with {successful_extractions} successful extractions")
    print(f"__OUTPUTS__ {json.dumps(outputs)}")
    
except Exception as e:
    logger.error(f"Error generating final report: {str(e)}")
    outputs = {
        "workflow_execution": {
            "status": "failed",
            "error": str(e),
            "completed_at": datetime.now().isoformat()
        },
        "processing_summary": {
            "companies_processed": 0,
            "successful_extractions": 0,
            "error_details": str(e)
        }
    }
    print(f"__OUTPUTS__ {json.dumps(outputs)}")

YAML Source

id: financial_data_extraction_workflow_v1
name: Financial Data Extraction Pipeline
retry:
  retryOn:
  - TEMPORARY_FAILURE
  - TIMEOUT
  - NETWORK_ERROR
  maxDelay: 60s
  maxAttempts: 2
  initialDelay: 5s
  backoffMultiplier: 2.0
tasks:
- id: list_company_folders
  name: List Company Folders
  type: script
  script: "import requests\nimport json\n\nprint(\"\U0001F4C1 LISTING COMPANY FOLDERS\"\
    )\nprint(\"=\" * 50)\nprint(\"\")\n\ntry:\n    # Get credentials directly like\
    \ working workflow\n    nango_connection_id = \"${nango_connection_id}\"\n   \
    \ nango_key = \"${nango_key}\"\n    folder_id = \"${folder_id}\"\n    \n    print(f\"\
    \U0001F4CB Target Folder ID: {folder_id}\")\n    print(\"\")\n    \n    # Get\
    \ access token\n    auth_url = f\"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h\"\
    \n    headers = {\n        'Authorization': f'Bearer {nango_key}',\n        'Content-Type':\
    \ 'application/json'\n    }\n    \n    print(\"\U0001F510 Getting access token...\"\
    )\n    auth_response = requests.get(auth_url, headers=headers, timeout=10)\n \
    \   auth_response.raise_for_status()\n    access_token = auth_response.json()['credentials']['access_token']\n\
    \    print(\"\u2705 Access token obtained\")\n    \n    # List folders in main\
    \ directory\n    drive_headers = {\n        \"Authorization\": f\"Bearer {access_token}\"\
    ,\n        \"Content-Type\": \"application/json\",\n    }\n    \n    query = f\"\
    '{folder_id}' in parents and mimeType='application/vnd.google-apps.folder' and\
    \ trashed=false\"\n    params = {\"q\": query, \"fields\": \"files(id, name)\"\
    , \"pageSize\": 100}\n    \n    print(\"\U0001F4C2 Fetching company folders...\"\
    )\n    response = requests.get(\"https://www.googleapis.com/drive/v3/files\",\
    \ headers=drive_headers, params=params)\n    response.raise_for_status()\n   \
    \ \n    folders = response.json().get(\"files\", [])\n    \n    # Limit to first\
    \ 3 companies to avoid timeout\n    limited_folders = folders[:3]\n    \n    print(f\"\
    \u2705 Found {len(folders)} total companies\")\n    print(f\"\U0001F4CA Processing\
    \ {len(limited_folders)} companies (limited for demo)\")\n    \n    for i, folder\
    \ in enumerate(limited_folders, 1):\n        print(f\"   {i}. {folder['name']}\
    \ (ID: {folder['id']})\")\n    \n    outputs = {\n        \"folders\": limited_folders,\n\
    \        \"total_companies\": len(folders),\n        \"processing_companies\"\
    : len(limited_folders),\n        \"access_token\": access_token,  # Pass token\
    \ to loop tasks\n        \"status\": \"success\"\n    }\n    \n    print(\"\"\
    )\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\")\n    \nexcept Exception as\
    \ e:\n    print(f\"\u274C Error listing company folders: {str(e)}\")\n    outputs\
    \ = {\"status\": \"failed\", \"error\": str(e)}\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\"\
    )\n"
  description: Get all company folders from the main directory
  timeout_seconds: 120
- id: process_companies_loop
  name: Process Companies with Loop
  type: loop
  loop_type: for
  depends_on:
  - list_company_folders
  loop_tasks:
  - id: process_current_company
    name: Process Current Company
    type: script
    script: "import requests\nimport json\nimport logging\n\nlogger = logging.getLogger(__name__)\n\
      \ntry:\n    folder_data = ${list_company_folders}\n    company_idx = ${company_index}\n\
      \    \n    print(f\"\U0001F50D DEBUG: folder data {folder_data}\")\n    print(f\"\
      \U0001F50D DEBUG: Loop iteration {company_idx}\")\n    print(f\"\U0001F4C1 Folder\
      \ data status: {folder_data.get('status', 'unknown')}\")\n    print(f\"\U0001F4CA\
      \ Number of folders found: {len(folder_data.get('folders', []))}\")\n    print(f\"\
      \U0001F4CB Company index: {company_idx}\")\n    \n    # Check if we have the\
      \ required data (folders and access_token) instead of just status\n    if not\
      \ folder_data.get(\"folders\") or not folder_data.get(\"access_token\"):\n \
      \       outputs = {\"status\": \"skipped\", \"reason\": f\"Missing required\
      \ data: folders={bool(folder_data.get('folders'))}, access_token={bool(folder_data.get('access_token'))}\"\
      }\n        print(f\"__OUTPUTS__ {json.dumps(outputs)}\")\n        exit()\n \
      \   \n    if company_idx >= len(folder_data[\"folders\"]):\n        outputs\
      \ = {\"status\": \"skipped\", \"reason\": f\"Company index {company_idx} >=\
      \ {len(folder_data['folders'])} companies\"}\n        print(f\"__OUTPUTS__ {json.dumps(outputs)}\"\
      )\n        exit()\n    \n    company = folder_data[\"folders\"][company_idx]\n\
      \    company_name = company[\"name\"]\n    company_id = company[\"id\"]\n  \
      \  \n    # Get fresh access token (tokens might expire during processing)\n\
      \    print(f\"\U0001F510 DEBUG: Getting fresh access token for file processing...\"\
      )\n    nango_connection_id = \"${nango_connection_id}\"\n    nango_key = \"\
      ${nango_key}\"\n    \n    auth_url = f\"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h\"\
      \n    auth_headers = {\n        'Authorization': f'Bearer {nango_key}',\n  \
      \      'Content-Type': 'application/json'\n    }\n    \n    auth_response =\
      \ requests.get(auth_url, headers=auth_headers, timeout=10)\n    auth_response.raise_for_status()\n\
      \    access_token = auth_response.json()['credentials']['access_token']\n  \
      \  print(f\"\u2705 DEBUG: Fresh access token obtained\")\n    \n    base_url\
      \ = \"https://www.googleapis.com/drive/v3\"\n    headers = {\n        \"Authorization\"\
      : f\"Bearer {access_token}\",\n        \"Content-Type\": \"application/json\"\
      ,\n    }\n    \n    # List FY folders in company\n    query = f\"'{company_id}'\
      \ in parents and mimeType='application/vnd.google-apps.folder' and trashed=false\"\
      \n    params = {\"q\": query, \"fields\": \"files(id, name)\", \"pageSize\"\
      : 10}\n    \n    response = requests.get(f\"{base_url}/files\", headers=headers,\
      \ params=params)\n    response.raise_for_status()\n    \n    fy_folders = response.json().get(\"\
      files\", [])\n    \n    outputs = {\n        \"company_name\": company_name,\n\
      \        \"company_id\": company_id,\n        \"fy_folders\": fy_folders[:2],\
      \  # Limit to 2 FY per company\n        \"total_fy_folders\": len(fy_folders),\n\
      \        \"company_index\": company_idx,\n        \"status\": \"success\"\n\
      \    }\n    \n    logger.info(f\"[{company_idx}] Company {company_name}: Found\
      \ {len(fy_folders)} FY folders\")\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\"\
      )\n    \nexcept Exception as e:\n    logger.error(f\"Error processing company\
      \ {company_idx}: {str(e)}\")\n    outputs = {\"status\": \"failed\", \"error\"\
      : str(e), \"company_index\": company_idx}\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\"\
      )\n"
    depends_on:
    - list_company_folders
    timeout_seconds: 180
  - id: process_fy_folders_current_company
    name: Process FY Folders for Current Company
    type: script
    script: "import requests\nimport json\nimport logging\nimport io\nimport time\n\
      from datetime import datetime\nfrom google import genai\n\nlogger = logging.getLogger(__name__)\n\
      \ntry:\n    # Debug: Print all available inputs\n    print(f\"\U0001F50D DEBUG:\
      \ All available inputs: {list(inputs.keys())}\")\n    print(f\"\U0001F50D DEBUG:\
      \ Loop state keys: {list(loop_state.keys())}\")\n    \n    # Access company\
      \ data fields directly from inputs (loop executor flattens task outputs)\n \
      \   company_name = inputs.get('company_name', '')\n    company_id = inputs.get('company_id',\
      \ '')\n    fy_folders = inputs.get('fy_folders', [])\n    task_status = inputs.get('status',\
      \ '')\n    \n    # Access folder data from workflow inputs (available in loop\
      \ context)\n    # In loop context, list_company_folders comes as string, need\
      \ to parse it\n    folder_data_raw = inputs.get('list_company_folders', {})\n\
      \    if isinstance(folder_data_raw, str):\n        folder_data = json.loads(folder_data_raw)\n\
      \    else:\n        folder_data = folder_data_raw\n    \n    print(f\"\U0001F50D\
      \ DEBUG: Company name: {company_name}\")\n    print(f\"\U0001F50D DEBUG: Company\
      \ ID: {company_id}\")\n    print(f\"\U0001F50D DEBUG: FY folders: {fy_folders}\"\
      )\n    print(f\"\U0001F50D DEBUG: Task status: {task_status}\")\n    \n    if\
      \ task_status != \"success\" or not fy_folders:\n        print(f\"\U0001F50D\
      \ DEBUG: Skipping company - task_status: {task_status}, fy_folders: {len(fy_folders)\
      \ if fy_folders else 0}\")\n        outputs = {\"status\": \"skipped\", \"reason\"\
      : \"No FY folders to process for this company\"}\n        print(f\"__OUTPUTS__\
      \ {json.dumps(outputs)}\")\n        exit()\n    \n    # company_name, company_id,\
      \ fy_folders already extracted above\n    fy_folders = fy_folders[:2]  # Process\
      \ max 2 FY folders\n    \n    # Initialize Gemini client\n    gemini_api_key\
      \ = \"${gemini_api_key}\"\n    client = genai.Client(api_key=gemini_api_key)\n\
      \    model_id = \"gemini-2.5-flash\"\n    \n    # Get fresh access token for\
      \ file downloads (prevent expiration issues)\n    print(f\"\U0001F510 DEBUG:\
      \ Getting fresh access token for file downloads...\")\n    nango_connection_id\
      \ = \"${nango_connection_id}\"\n    nango_key = \"${nango_key}\"\n    \n   \
      \ auth_url = f\"https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h\"\
      \n    auth_headers = {\n        'Authorization': f'Bearer {nango_key}',\n  \
      \      'Content-Type': 'application/json'\n    }\n    \n    auth_response =\
      \ requests.get(auth_url, headers=auth_headers, timeout=10)\n    auth_response.raise_for_status()\n\
      \    access_token = auth_response.json()['credentials']['access_token']\n  \
      \  print(f\"\u2705 DEBUG: Fresh access token obtained for downloads\")\n   \
      \ \n    base_url = \"https://www.googleapis.com/drive/v3\"\n    headers = {\n\
      \        \"Authorization\": f\"Bearer {access_token}\",\n        \"Content-Type\"\
      : \"application/json\",\n    }\n    \n    company_extractions = []\n    \n \
      \   print(f\"\U0001F50D DEBUG: Starting FY processing for {company_name}\")\n\
      \    print(f\"\U0001F50D DEBUG: FY folders to process: {len(fy_folders)}\")\n\
      \    \n    for fy_idx, fy_folder in enumerate(fy_folders):\n        try:\n \
      \           fy_name = fy_folder[\"name\"]\n            fy_id = fy_folder[\"\
      id\"]\n            \n            print(f\"\U0001F50D DEBUG: Processing FY folder\
      \ {fy_idx + 1}/{len(fy_folders)}: {fy_name}\")\n            logger.info(f\"\
      Processing {company_name} - {fy_name}\")\n            \n            # Get all\
      \ items in FY folder (including sub-folders)\n            query = f\"'{fy_id}'\
      \ in parents and trashed=false\"\n            params = {\n                \"\
      q\": query, \n                \"fields\": \"files(id, name, mimeType, size)\"\
      , \n                \"pageSize\": 20\n            }\n            \n        \
      \    response = requests.get(f\"{base_url}/files\", headers=headers, params=params)\n\
      \            response.raise_for_status()\n            \n            all_items\
      \ = response.json().get(\"files\", [])\n            all_files = []\n       \
      \     \n            # Look for actual files, and also search inside sub-folders\n\
      \            for item in all_items:\n                print(f\"\U0001F50D DEBUG:\
      \ Found item: {item['name']} (MIME: {item['mimeType']})\")\n               \
      \ \n                if item['mimeType'] == 'application/vnd.google-apps.folder':\n\
      \                    # This is a folder, search inside it for actual files\n\
      \                    print(f\"\U0001F4C1 DEBUG: Searching inside folder: {item['name']}\"\
      )\n                    sub_query = f\"'{item['id']}' in parents and trashed=false\"\
      \n                    sub_params = {\n                        \"q\": sub_query,\
      \ \n                        \"fields\": \"files(id, name, mimeType, size)\"\
      , \n                        \"pageSize\": 20\n                    }\n      \
      \              \n                    sub_response = requests.get(f\"{base_url}/files\"\
      , headers=headers, params=sub_params)\n                    sub_response.raise_for_status()\n\
      \                    sub_files = sub_response.json().get(\"files\", [])\n  \
      \                  \n                    print(f\"\U0001F4C4 DEBUG: Found {len(sub_files)}\
      \ items inside {item['name']}\")\n                    all_files.extend(sub_files)\n\
      \                else:\n                    # This is a direct file\n      \
      \              all_files.append(item)\n            \n            # Filter for\
      \ PDF files and relevant documents  \n            print(f\"\U0001F4CA DEBUG:\
      \ Total files found after folder exploration: {len(all_files)}\")\n        \
      \    pdf_files = []\n            for file in all_files:\n                print(f\"\
      \U0001F50D DEBUG: Checking file: {file['name']} (MIME: {file['mimeType']})\"\
      )\n                \n                # Only include actual PDF files or Google\
      \ Apps documents (but NOT folders)\n                if (file[\"name\"].lower().endswith(\"\
      .pdf\") or \n                    file[\"mimeType\"] == \"application/pdf\"):\n\
      \                    pdf_files.append({\n                        \"id\": file[\"\
      id\"],\n                        \"name\": file[\"name\"],\n                \
      \        \"mimeType\": file[\"mimeType\"]\n                    })\n        \
      \            print(f\"\u2705 DEBUG: Added PDF file: {file['name']}\")\n    \
      \            elif (\"application/vnd.google-apps\" in file[\"mimeType\"] and\
      \ \n                      file[\"mimeType\"] != \"application/vnd.google-apps.folder\"\
      ):\n                    # Include Google Apps documents but NOT folders\n  \
      \                  pdf_files.append({\n                        \"id\": file[\"\
      id\"],\n                        \"name\": file[\"name\"],\n                \
      \        \"mimeType\": file[\"mimeType\"]\n                    })\n        \
      \            print(f\"\U0001F4C4 DEBUG: Added Google Apps file: {file['name']}\
      \ ({file['mimeType']})\")\n                else:\n                    print(f\"\
      \u23ED\uFE0F  DEBUG: Skipped file: {file['name']} (unsupported type: {file['mimeType']})\"\
      )\n            \n            # Limit to first 3 files to avoid size limits\n\
      \            pdf_files = pdf_files[:3]\n            \n            print(f\"\U0001F50D\
      \ DEBUG: Found {len(all_files)} total files, {len(pdf_files)} PDF files in {fy_name}\"\
      )\n            \n            if not pdf_files:\n                print(f\"\u26A0\
      \uFE0F  DEBUG: No PDF files found in {company_name} - {fy_name}\")\n       \
      \         logger.warning(f\"No PDF files found in {company_name} - {fy_name}\"\
      )\n                continue\n            \n            # Upload files to Gemini\n\
      \            uploaded_files = []\n            \n            print(f\"\U0001F50D\
      \ DEBUG: Starting file upload for {len(pdf_files)} files\")\n            \n\
      \            for pdf_file in pdf_files:\n                try:\n            \
      \        file_id = pdf_file[\"id\"]\n                    file_name = pdf_file[\"\
      name\"]\n                    \n                    print(f\"\U0001F50D DEBUG:\
      \ Processing file: {file_name} (ID: {file_id})\")\n                    \n  \
      \                  # Download file content (limited size)\n                \
      \    print(f\"\U0001F50D DEBUG: File MIME type: {pdf_file['mimeType']}\")\n\
      \                    \n                    if \"application/vnd.google-apps\"\
      \ in pdf_file[\"mimeType\"]:\n                        # For Google Apps files,\
      \ try to export as PDF\n                        if \"spreadsheet\" in pdf_file[\"\
      mimeType\"]:\n                            export_mime = \"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\"\
      \n                        elif \"document\" in pdf_file[\"mimeType\"]:\n   \
      \                         export_mime = \"application/pdf\"\n              \
      \          elif \"presentation\" in pdf_file[\"mimeType\"]:\n              \
      \              export_mime = \"application/pdf\"\n                        else:\n\
      \                            export_mime = \"application/pdf\"\n           \
      \             \n                        download_url = f\"{base_url}/files/{file_id}/export\"\
      \n                        params = {\"mimeType\": export_mime}\n           \
      \             print(f\"\U0001F50D DEBUG: Using export endpoint with MIME: {export_mime}\"\
      )\n                    else:\n                        # For regular files, use\
      \ direct download\n                        download_url = f\"{base_url}/files/{file_id}\"\
      \n                        params = {\"alt\": \"media\"}\n                  \
      \      print(f\"\U0001F50D DEBUG: Using direct download endpoint\")\n      \
      \              \n                    print(f\"\U0001F50D DEBUG: Downloading\
      \ file from: {download_url}\")\n                    \n                    try:\n\
      \                        response = requests.get(download_url, headers=headers,\
      \ params=params, \n                                              stream=True,\
      \ timeout=30)\n                        response.raise_for_status()\n       \
      \                 print(f\"\U0001F50D DEBUG: Download successful, status: {response.status_code}\"\
      )\n                    except requests.exceptions.HTTPError as e:\n        \
      \                if response.status_code == 403:\n                         \
      \   print(f\"\u26A0\uFE0F  DEBUG: 403 Forbidden - trying alternative download\
      \ method\")\n                            # For Google Apps files that fail export,\
      \ skip them for now\n                            if \"application/vnd.google-apps\"\
      \ in pdf_file[\"mimeType\"]:\n                                print(f\"\u26A0\
      \uFE0F  DEBUG: Skipping Google Apps file due to export permissions: {file_name}\"\
      )\n                                continue\n                            else:\n\
      \                                raise e\n                        else:\n  \
      \                          raise e\n                    \n                 \
      \   # Download complete file content (no size limit for testing)\n         \
      \           content = response.content  # Download complete file\n         \
      \           print(f\"\U0001F50D DEBUG: Downloaded complete file: {len(content)}\
      \ bytes\")\n                    \n                    if len(content) == 0:\n\
      \                        print(f\"\u26A0\uFE0F  DEBUG: File {file_name} has\
      \ no content, skipping\")\n                        continue\n              \
      \      \n                    # Validate PDF format\n                    if not\
      \ content.startswith(b'%PDF-'):\n                        print(f\"\u26A0\uFE0F\
      \  DEBUG: File {file_name} doesn't appear to be a valid PDF, skipping\")\n \
      \                       continue\n                    \n                   \
      \ print(f\"\u2705 DEBUG: PDF validation passed for {file_name}\")\n        \
      \            \n                    print(f\"\U0001F50D DEBUG: Content size:\
      \ {len(content)} bytes, uploading to Gemini...\")\n                    # Upload\
      \ to Gemini with explicit MIME type\n                    try:\n            \
      \            file = client.files.upload(\n                            file=io.BytesIO(content),\n\
      \                            config={\n                                \"display_name\"\
      : file_name,\n                                \"mime_type\": \"application/pdf\"\
      \  # Explicitly set PDF MIME type\n                            }\n         \
      \               )\n                        print(f\"\u2705 DEBUG: File uploaded\
      \ to Gemini: {file.name}\")\n                        \n                    \
      \    # Wait for Gemini to process the complete file\n                      \
      \  time.sleep(10)  # Increased wait time for larger files\n                \
      \        \n                        # Verify file was processed correctly\n \
      \                       uploaded_file_info = client.files.get(name=file.name)\n\
      \                        print(f\"\U0001F50D DEBUG: Gemini file state: {uploaded_file_info.state}\"\
      )\n                        print(f\"\U0001F50D DEBUG: Gemini detected MIME:\
      \ {uploaded_file_info.mime_type}\")\n                        \n            \
      \            # Check if file is ready for processing\n                     \
      \   if uploaded_file_info.state.name != \"ACTIVE\":\n                      \
      \      print(f\"\u26A0\uFE0F  DEBUG: File {file_name} not in ACTIVE state: {uploaded_file_info.state.name}\"\
      )\n                            time.sleep(2)  # Additional wait\n          \
      \                  uploaded_file_info = client.files.get(name=file.name)\n \
      \                           print(f\"\U0001F50D DEBUG: File state after wait:\
      \ {uploaded_file_info.state.name}\")\n                        \n           \
      \         except Exception as upload_error:\n                        print(f\"\
      \u274C DEBUG: Gemini upload failed: {str(upload_error)}\")\n               \
      \         continue\n                    \n                    uploaded_files.append({\n\
      \                        \"name\": file.name,\n                        \"display_name\"\
      : file_name,\n                        \"size\": len(content)\n             \
      \       })\n                    \n                    logger.info(f\"Uploaded\
      \ {file_name} to Gemini ({len(content)} bytes)\")\n                    time.sleep(1)\
      \  # Rate limiting\n                    \n                except Exception as\
      \ e:\n                    print(f\"\u274C DEBUG: Error uploading {pdf_file['name']}:\
      \ {str(e)}\")\n                    \n                    # Check if it's a permission\
      \ issue with Google Apps files\n                    if \"403\" in str(e) and\
      \ \"application/vnd.google-apps\" in pdf_file[\"mimeType\"]:\n             \
      \           print(f\"\u26A0\uFE0F  DEBUG: Google Apps file export requires additional\
      \ permissions\")\n                        print(f\"\u26A0\uFE0F  DEBUG: File\
      \ '{pdf_file['name']}' skipped due to insufficient export permissions\")\n \
      \                   \n                    logger.error(f\"Error uploading {pdf_file['name']}:\
      \ {str(e)}\")\n                    continue\n            \n            if not\
      \ uploaded_files:\n                print(f\"\u26A0\uFE0F  DEBUG: No files uploaded\
      \ successfully for {company_name} - {fy_name}\")\n                print(f\"\U0001F4CB\
      \ DEBUG: Summary - Found {len(pdf_files)} files, but 0 uploaded due to permissions\"\
      )\n                logger.warning(f\"No files uploaded successfully for {company_name}\
      \ - {fy_name}\")\n                continue\n            \n            print(f\"\
      \u2705 DEBUG: Successfully uploaded {len(uploaded_files)} files for {company_name}\
      \ - {fy_name}\")\n            \n            # Get file references for Gemini\n\
      \            print(f\"\U0001F517 DEBUG: Getting Gemini file references...\"\
      )\n            files = []\n            for file_info in uploaded_files:\n  \
      \              gemini_file = client.files.get(name=file_info[\"name\"])\n  \
      \              files.append(gemini_file)\n                print(f\"\U0001F517\
      \ DEBUG: Got reference for: {file_info['display_name']} -> {gemini_file.name}\"\
      )\n            \n            print(f\"\U0001F680 DEBUG: Ready to process {len(files)}\
      \ files with Gemini AI\")\n            \n            # Wait additional time\
      \ for all files to be fully processed by Gemini\n            print(f\"\u23F3\
      \ DEBUG: Waiting for all files to be fully processed by Gemini...\")\n     \
      \       time.sleep(15)  # Increased wait time for complete PDF processing\n\
      \            \n            # Verify all files are ready for processing\n   \
      \         print(f\"\U0001F50D DEBUG: Verifying file readiness...\")\n      \
      \      for file in files:\n                file_info = client.files.get(name=file.name)\n\
      \                print(f\"\U0001F4C4 DEBUG: File {file.name} state: {file_info.state.name}\"\
      )\n                if file_info.state.name != \"ACTIVE\":\n                \
      \    print(f\"\u26A0\uFE0F  DEBUG: File {file.name} not ready, waiting additional\
      \ time...\")\n                    time.sleep(5)\n            \n            #\
      \ Extract Balance Sheet data\n            print(f\"\U0001F916 DEBUG: Starting\
      \ Gemini extraction for Balance Sheet data...\")\n            balance_sheet_prompt\
      \ = f\"\"\"\n            Extract Balance Sheet data for {company_name} - {fy_name}.\
      \ Return JSON:\n            {{\n              \"assets\": {{\"total_assets\"\
      : null, \"current_assets\": null, \"fixed_assets\": null}},\n              \"\
      liabilities\": {{\"total_liabilities\": null, \"current_liabilities\": null}},\n\
      \              \"equity\": {{\"shareholders_funds\": null}},\n             \
      \ \"financial_year\": \"{fy_name}\"\n            }}\n            All figures\
      \ in Rs. Million. Use null for missing values.\n            \"\"\"\n       \
      \     \n            print(f\"\U0001F916 DEBUG: Sending Balance Sheet prompt\
      \ to Gemini...\")\n            balance_response = client.models.generate_content(\n\
      \                model=model_id,\n                contents=[balance_sheet_prompt]\
      \ + files,\n                config={\"response_mime_type\": \"application/json\"\
      }\n            )\n            \n            print(f\"\U0001F916 DEBUG: Gemini\
      \ Balance Sheet response: {balance_response.text[:200]}...\")\n            balance_sheet\
      \ = json.loads(balance_response.text)\n            print(f\"\u2705 DEBUG: Balance\
      \ Sheet data extracted successfully\")\n            \n            # Extract\
      \ P&L data\n            print(f\"\U0001F916 DEBUG: Starting Gemini extraction\
      \ for P&L data...\")\n            pl_prompt = f\"\"\"\n            Extract P&L\
      \ data for {company_name} - {fy_name}. Return JSON:\n            {{\n      \
      \        \"revenue\": {{\"net_revenue\": null, \"revenue_growth\": null}},\n\
      \              \"profitability\": {{\"ebitda\": null, \"net_profit\": null,\
      \ \"ebitda_margin\": null}},\n              \"financial_year\": \"{fy_name}\"\
      \n            }}\n            All figures in Rs. Million. Use null for missing\
      \ values.\n            \"\"\"\n            \n            print(f\"\U0001F916\
      \ DEBUG: Sending P&L prompt to Gemini...\")\n            pl_response = client.models.generate_content(\n\
      \                model=model_id,\n                contents=[pl_prompt] + files,\n\
      \                config={\"response_mime_type\": \"application/json\"}\n   \
      \         )\n            \n            print(f\"\U0001F916 DEBUG: Gemini P&L\
      \ response: {pl_response.text[:200]}...\")\n            pl_data = json.loads(pl_response.text)\n\
      \            print(f\"\u2705 DEBUG: P&L data extracted successfully\")\n   \
      \         \n            # Compile extraction results\n            extraction_result\
      \ = {\n                \"company_name\": company_name,\n                \"fy_name\"\
      : fy_name,\n                \"files_processed\": len(uploaded_files),\n    \
      \            \"balance_sheet\": balance_sheet,\n                \"profit_loss\"\
      : pl_data,\n                \"extraction_timestamp\": datetime.now().isoformat(),\n\
      \                \"status\": \"success\"\n            }\n            \n    \
      \        print(f\"\U0001F4CA DEBUG: Extraction result compiled:\")\n       \
      \     print(f\"   - Company: {company_name}\")\n            print(f\"   - FY:\
      \ {fy_name}\")\n            print(f\"   - Files processed: {len(uploaded_files)}\"\
      )\n            print(f\"   - Balance Sheet extracted: {bool(balance_sheet)}\"\
      )\n            print(f\"   - P&L extracted: {bool(pl_data)}\")\n           \
      \ \n            company_extractions.append(extraction_result)\n            logger.info(f\"\
      Successfully extracted data for {company_name} - {fy_name}\")\n            print(f\"\
      \u2705 DEBUG: Added extraction result to company_extractions ({len(company_extractions)}\
      \ total)\")\n            \n        except Exception as e:\n            print(f\"\
      \u274C DEBUG: Error processing {company_name} - {fy_name}: {str(e)}\")\n   \
      \         logger.error(f\"Error processing {company_name} - {fy_name}: {str(e)}\"\
      )\n            continue\n    \n    # Update loop state with results\n    current_processed\
      \ = loop_state.get('processed_companies', [])\n    current_processed.append({\n\
      \        \"company_name\": company_name,\n        \"company_id\": company_id,\n\
      \        \"fy_extractions\": company_extractions,\n        \"total_fy_processed\"\
      : len(company_extractions)\n    })\n    \n    current_total = loop_state.get('total_extractions',\
      \ 0)\n    current_successful = loop_state.get('successful_extractions', 0)\n\
      \    \n    # Initialize state updates dictionary\n    state_updates = {}\n \
      \   \n    # Update state variables\n    state_updates['processed_companies']\
      \ = current_processed\n    state_updates['total_extractions'] = current_total\
      \ + len(company_extractions)\n    state_updates['successful_extractions'] =\
      \ current_successful + len(company_extractions)\n    \n    print(f\"\U0001F50D\
      \ DEBUG: Final results for {company_name}: {len(company_extractions)} extractions\
      \ completed\")\n    \n    # Create Excel file if we have extractions\n    excel_upload_result\
      \ = None\n    if company_extractions:\n        try:\n            print(f\"\U0001F4CA\
      \ DEBUG: Creating Excel workbook for {company_name}\")\n            \n     \
      \       # Import required modules for Excel creation\n            from openpyxl\
      \ import Workbook\n            from openpyxl.styles import PatternFill, Font,\
      \ Alignment, Border, Side\n            from openpyxl.utils import get_column_letter\n\
      \            import pandas as pd\n            \n            # Create Excel workbook\n\
      \            wb = Workbook()\n            wb.remove(wb.active)  # Remove default\
      \ sheet\n            \n            # Define styles\n            header_fill\
      \ = PatternFill(start_color=\"366092\", end_color=\"366092\", fill_type=\"solid\"\
      )\n            header_font = Font(bold=True, color=\"FFFFFF\", size=11)\n  \
      \          metric_fill = PatternFill(start_color=\"DCE6F1\", end_color=\"DCE6F1\"\
      , fill_type=\"solid\")\n            metric_font = Font(bold=True, size=10)\n\
      \            data_font = Font(size=10)\n            center_align = Alignment(horizontal=\"\
      center\", vertical=\"center\")\n            left_align = Alignment(horizontal=\"\
      left\", vertical=\"center\")\n            right_align = Alignment(horizontal=\"\
      right\", vertical=\"center\")\n            \n            thin_border = Border(\n\
      \                left=Side(style=\"thin\"), right=Side(style=\"thin\"),\n  \
      \              top=Side(style=\"thin\"), bottom=Side(style=\"thin\")\n     \
      \       )\n            \n            # Helper function to flatten nested dictionaries\
      \ (mimicking pl_workflow)\n            def flatten_dict(d, parent_key=\"\",\
      \ sep=\"_\"):\n                items = []\n                for k, v in d.items():\n\
      \                    new_key = f\"{parent_key}{sep}{k}\" if parent_key else\
      \ k\n                    if isinstance(v, dict):\n                        items.extend(flatten_dict(v,\
      \ new_key, sep=sep).items())\n                    else:\n                  \
      \      items.append((new_key, v))\n                return dict(items)\n    \
      \        \n            # Get all unique section names from extractions (mimicking\
      \ pl_workflow approach)\n            all_sections = set()\n            for fy_data\
      \ in company_extractions:\n                # Our data structure has balance_sheet\
      \ and profit_loss directly in fy_data\n                for key in fy_data.keys():\n\
      \                    if key not in [\"company_name\", \"fy_name\", \"files_processed\"\
      , \"extraction_timestamp\", \"status\"]:\n                        all_sections.add(key)\n\
      \            \n            print(f\"\U0001F50D DEBUG: Found sections to export:\
      \ {list(all_sections)}\")\n            \n            # Process each section\
      \ found in extractions\n            for section_name in sorted(all_sections):\n\
      \                ws = wb.create_sheet(title=section_name[:31])\n           \
      \     \n                # Collect data from all FYs for this section\n     \
      \           section_data = {}\n                financial_years = []\n      \
      \          \n                for fy_data in company_extractions:\n         \
      \           fy_name = fy_data.get(\"fy_name\", \"Unknown\")\n              \
      \      \n                    if section_name in fy_data:\n                 \
      \       raw_data = fy_data[section_name]\n                        \n       \
      \                 # Convert to nested structure like pl_workflow\n         \
      \               nested_data = {\n                            \"Financial_Year\"\
      : fy_name,\n                            \"note\": \"All figures in Rs. Million\
      \ unless otherwise stated\"\n                        }\n                   \
      \     \n                        # Add the actual extracted data\n          \
      \              if isinstance(raw_data, dict):\n                            nested_data.update(raw_data)\n\
      \                        else:\n                            nested_data[\"extracted_data\"\
      ] = raw_data\n                        \n                        # Flatten the\
      \ nested structure (mimicking pl_workflow)\n                        flattened\
      \ = flatten_dict(nested_data)\n                        \n                  \
      \      # Remove Financial_Year to avoid duplicates (like pl_workflow)\n    \
      \                    if \"Financial_Year\" in flattened:\n                 \
      \           del flattened[\"Financial_Year\"]\n                        \n  \
      \                      section_data[fy_name] = flattened\n                 \
      \       financial_years.append(fy_name)\n                    else:\n       \
      \                 # Create empty entry if section not found\n              \
      \          section_data[fy_name] = {\"No_Data\": \"Available\"}\n          \
      \              financial_years.append(fy_name)\n                \n         \
      \       if section_data and financial_years:\n                    # Create DataFrame\
      \ with financial years as columns\n                    df = pd.DataFrame.from_dict(section_data,\
      \ orient=\"columns\")\n                    df = df.reindex(sorted(df.columns),\
      \ axis=1)\n                    df.reset_index(inplace=True)\n              \
      \      df.rename(columns={\"index\": \"Metrics\"}, inplace=True)\n         \
      \           df[\"Metrics\"] = df[\"Metrics\"].str.replace(\"_\", \" \").str.title()\n\
      \                    \n                    # Write headers\n               \
      \     headers = [\"Metrics\"] + list(df.columns[1:])\n                    ws.append(headers)\n\
      \                    \n                    # Apply header formatting\n     \
      \               for col_idx, header in enumerate(headers, 1):\n            \
      \            cell = ws.cell(row=1, column=col_idx)\n                       \
      \ cell.fill = header_fill\n                        cell.font = header_font\n\
      \                        cell.alignment = center_align\n                   \
      \     cell.border = thin_border\n                    \n                    #\
      \ Write data rows\n                    for idx, row in df.iterrows():\n    \
      \                    ws.append(row.tolist())\n                        \n   \
      \                     metric_cell = ws.cell(row=idx + 2, column=1)\n       \
      \                 metric_cell.fill = metric_fill\n                        metric_cell.font\
      \ = metric_font\n                        metric_cell.alignment = left_align\n\
      \                        metric_cell.border = thin_border\n                \
      \        \n                        for col_idx in range(2, len(headers) + 1):\n\
      \                            data_cell = ws.cell(row=idx + 2, column=col_idx)\n\
      \                            data_cell.font = data_font\n                  \
      \          data_cell.alignment = right_align\n                            data_cell.border\
      \ = thin_border\n                    \n                    # Adjust column widths\n\
      \                    ws.column_dimensions[\"A\"].width = 40\n              \
      \      for col_idx in range(2, len(headers) + 1):\n                        ws.column_dimensions[get_column_letter(col_idx)].width\
      \ = 15\n                    \n                    # Freeze panes\n         \
      \           ws.freeze_panes = \"B2\"\n                    \n               \
      \     # Add sheet title\n                    ws.insert_rows(1)\n           \
      \         ws.merge_cells(f\"A1:{get_column_letter(len(headers))}1\")\n     \
      \               title_cell = ws[\"A1\"]\n                    title_cell.value\
      \ = f\"{section_name} Analysis\"\n                    title_cell.font = Font(bold=True,\
      \ size=14, color=\"FFFFFF\")\n                    title_cell.fill = PatternFill(start_color=\"\
      1F4788\", end_color=\"1F4788\", fill_type=\"solid\")\n                    title_cell.alignment\
      \ = center_align\n                    \n                    print(f\"\u2705\
      \ DEBUG: Created sheet: {section_name} with {len(df)} metrics\")\n         \
      \       else:\n                    # Create empty sheet\n                  \
      \  ws.append([\"Metrics\", \"No Data Available\"])\n                    ws.column_dimensions[\"\
      A\"].width = 40\n                    ws.column_dimensions[\"B\"].width = 20\n\
      \                    print(f\"\u26A0\uFE0F  DEBUG: No data available for section:\
      \ {section_name}\")\n            \n            # Save to bytes\n           \
      \ excel_buffer = io.BytesIO()\n            wb.save(excel_buffer)\n         \
      \   excel_buffer.seek(0)\n            excel_content = excel_buffer.getvalue()\n\
      \            \n            # Upload to Google Drive in the company folder\n\
      \            excel_filename = f\"{company_name}_Financial_Analysis.xlsx\"\n\
      \            print(f\"\U0001F4E4 DEBUG: Uploading {excel_filename} to Google\
      \ Drive...\")\n            \n            # Get fresh access token for upload\n\
      \            nango_key = \"${nango_key}\"\n            nango_connection_id =\
      \ \"${nango_connection_id}\"\n            \n            print(f\"\U0001F50D\
      \ DEBUG: nango_key = {nango_key}\")\n            print(f\"\U0001F50D DEBUG:\
      \ nango_connection_id = {nango_connection_id}\")\n            auth_url = f\"\
      https://auth-dev.assistents.ai/connection/{nango_connection_id}?provider_config_key=google-drive-hq3h\"\
      \n            print(f\"\U0001F50D DEBUG: Request URL = {auth_url}\")\n     \
      \       \n            nango_response = requests.get(\n                auth_url,\n\
      \                headers={\"Authorization\": f\"Bearer {nango_key}\", \"Content-Type\"\
      : \"application/json\"}\n            )\n            \n            print(f\"\U0001F50D\
      \ DEBUG: Response status code = {nango_response.status_code}\")\n          \
      \  print(f\"\U0001F50D DEBUG: Response text = {nango_response.text}\")\n   \
      \         \n            if nango_response.status_code == 200:\n            \
      \    fresh_access_token = nango_response.json()[\"credentials\"][\"access_token\"\
      ]\n                \n                # Upload metadata  \n                metadata\
      \ = {\"name\": excel_filename, \"parents\": [company_id]}\n                \n\
      \                files = {\n                    \"data\": (\"metadata\", json.dumps(metadata),\
      \ \"application/json; charset=UTF-8\"),\n                    \"file\": (excel_filename,\
      \ excel_content, \"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\"\
      )\n                }\n                \n                headers = {\"Authorization\"\
      : f\"Bearer {fresh_access_token}\"}\n                \n                upload_response\
      \ = requests.post(\n                    \"https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart\"\
      ,\n                    headers=headers,\n                    files=files\n \
      \               )\n                \n                if upload_response.status_code\
      \ == 200:\n                    upload_result = upload_response.json()\n    \
      \                excel_upload_result = {\n                        \"filename\"\
      : excel_filename,\n                        \"file_id\": upload_result.get('id'),\n\
      \                        \"status\": \"success\"\n                    }\n  \
      \                  print(f\"\u2705 DEBUG: Successfully uploaded {excel_filename}\"\
      )\n                    print(f\"\U0001F4CB DEBUG: File ID: {upload_result.get('id')}\"\
      )\n                else:\n                    excel_upload_result = {\n    \
      \                    \"filename\": excel_filename,\n                       \
      \ \"error\": upload_response.text,\n                        \"status\": \"failed\"\
      \n                    }\n                    print(f\"\u274C DEBUG: Failed to\
      \ upload {excel_filename}: {upload_response.text}\")\n            else:\n  \
      \              excel_upload_result = {\n                    \"error\": \"Failed\
      \ to get fresh access token\",\n                    \"status\": \"failed\"\n\
      \                }\n                print(f\"\u274C DEBUG: Failed to get fresh\
      \ access token: {nango_response.text}\")\n            \n        except Exception\
      \ as excel_error:\n            excel_upload_result = {\n                \"error\"\
      : str(excel_error),\n                \"status\": \"failed\"\n            }\n\
      \            print(f\"\u274C DEBUG: Error creating/uploading Excel: {str(excel_error)}\"\
      )\n    \n    outputs = {\n        \"company_name\": company_name,\n        \"\
      extractions_completed\": len(company_extractions),\n        \"excel_upload\"\
      : excel_upload_result,\n        \"status\": \"success\"\n    }\n    \n    logger.info(f\"\
      Completed processing {company_name}: {len(company_extractions)} extractions\"\
      )\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\")\n    print(f\"__STATE_UPDATES__\
      \ {json.dumps(state_updates)}\")\n    \nexcept Exception as e:\n    logger.error(f\"\
      Error processing company FY folders: {str(e)}\")\n    outputs = {\"status\"\
      : \"failed\", \"error\": str(e)}\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\"\
      )\n"
    depends_on:
    - process_current_company
    - list_company_folders
    requirements:
    - requests>=2.28.0
    - google-genai>=0.7.0
    - openpyxl>=3.0.0
    - pandas>=1.5.0
    timeout_seconds: 300
  description: Process multiple companies using loop with iteration limit
  previous_node: list_company_folders
  max_iterations: 1
  state_variables:
    total_extractions: 0
    processed_companies: []
    successful_extractions: 0
  iteration_variable: company_index
- id: final_report
  name: Generate Final Workflow Report
  type: script
  script: "import json\nimport logging\nfrom datetime import datetime\n\nlogger =\
    \ logging.getLogger(__name__)\n\ntry:\n    # Access loop final state results\n\
    \    loop_results = ${process_companies_loop}\n    company_folders = ${list_company_folders}\n\
    \    \n    current_time = datetime.now().isoformat()\n    \n    # Extract results\
    \ from loop final state\n    loop_final_state = loop_results.get(\"final_state\"\
    , {})\n    processed_companies = loop_final_state.get(\"processed_companies\"\
    , [])\n    total_extractions = loop_final_state.get(\"total_extractions\", 0)\n\
    \    successful_extractions = loop_final_state.get(\"successful_extractions\"\
    , 0)\n    iterations_completed = loop_results.get(\"iterations_completed\", 0)\n\
    \    \n    # Compile final comprehensive report\n    final_report = {\n      \
    \  \"workflow_execution\": {\n            \"execution_id\": \"${execution_id}\"\
    ,\n            \"workflow_id\": \"${workflow_id}\",\n            \"completed_at\"\
    : current_time,\n            \"status\": \"completed\"\n        },\n        \"\
    processing_summary\": {\n            \"total_companies_found\": company_folders.get(\"\
    total_companies\", 0),\n            \"companies_processed\": iterations_completed,\n\
    \            \"max_companies_limit\": 3,\n            \"total_extractions_attempted\"\
    : total_extractions,\n            \"successful_extractions\": successful_extractions,\n\
    \            \"success_rate\": f\"{(successful_extractions/total_extractions*100):.1f}%\"\
    \ if total_extractions > 0 else \"0%\"\n        },\n        \"extraction_summary\"\
    : {\n            \"sections_per_extraction\": [\"balance_sheet\", \"profit_loss\"\
    ],\n            \"total_sections_extracted\": successful_extractions * 2,\n  \
    \          \"ai_model_used\": \"gemini-2.5-flash\",\n            \"files_per_fy_limit\"\
    : 3,\n            \"file_size_limit_per_file\": \"400KB\"\n        },\n      \
    \  \"company_results\": processed_companies,\n        \"technical_implementation\"\
    : {\n            \"workflow_architecture\": \"nested_loops\",\n            \"\
    loop_structure\": {\n                \"companies_loop\": {\n                 \
    \   \"max_iterations\": 3,\n                    \"actual_iterations\": iterations_completed\n\
    \                },\n                \"fy_folders_loop\": {\n                \
    \    \"max_iterations_per_company\": 2,\n                    \"processes_balance_sheet_and_pl\"\
    : True\n                }\n            },\n            \"processing_constraints\"\
    : {\n                \"task_timeout_limit\": \"5 minutes per task\",\n       \
    \         \"data_transfer_limit\": \"1MB between tasks\",\n                \"\
    file_download_limit\": \"400KB per file\",\n                \"concurrent_file_processing\"\
    : \"3 files per FY folder\"\n            },\n            \"optimizations_applied\"\
    : [\n                \"Nested loop architecture for scalable processing\",\n \
    \               \"File size limiting to respect memory constraints\", \n     \
    \           \"Streaming file downloads with chunk processing\",\n            \
    \    \"Consolidated extraction within single task to minimize transfers\",\n \
    \               \"State variable tracking across loop iterations\"\n         \
    \   ]\n        },\n        \"performance_metrics\": {\n            \"loop_iterations_completed\"\
    : iterations_completed,\n            \"average_files_per_company\": total_extractions\
    \ / successful_extractions if successful_extractions > 0 else 0,\n           \
    \ \"processing_efficiency\": \"High - All extractions within single loop task\"\
    ,\n            \"memory_usage\": \"Optimized - No large data retention between\
    \ tasks\"\n        }\n    }\n    \n    outputs = final_report\n    \n    logger.info(f\"\
    Final workflow report generated successfully\")\n    logger.info(f\"Processed\
    \ {iterations_completed} companies with {successful_extractions} successful extractions\"\
    )\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\")\n    \nexcept Exception as\
    \ e:\n    logger.error(f\"Error generating final report: {str(e)}\")\n    outputs\
    \ = {\n        \"workflow_execution\": {\n            \"status\": \"failed\",\n\
    \            \"error\": str(e),\n            \"completed_at\": datetime.now().isoformat()\n\
    \        },\n        \"processing_summary\": {\n            \"companies_processed\"\
    : 0,\n            \"successful_extractions\": 0,\n            \"error_details\"\
    : str(e)\n        }\n    }\n    print(f\"__OUTPUTS__ {json.dumps(outputs)}\")"
  depends_on:
  - process_companies_loop
  description: Generate comprehensive report from all loop processing results
  previous_node: process_companies_loop
  timeout_seconds: 180
inputs:
- name: folder_id
  type: string
  default: 1W22-59ESyR-E_1PMVWevzL-WvlFALDl-
  required: true
  description: Google Drive folder ID containing company folders
- name: gemini_api_key
  type: string
  default: AIzaSyB0_e6aU4gF-qRapMm3UYBSITpbd0ehsYk
  required: true
  description: Gemini API key for AI processing
- name: nango_connection_id
  type: string
  default: 4274993f-c614-4efa-a01e-8d07422f4b09
  required: true
  description: Nango connection ID for Google Drive access
- name: nango_key
  type: string
  default: 8df3e2de-2307-48d3-94bd-ddd3fd6a62ec
  required: true
  description: Nango authentication key
outputs:
  company_results:
    type: object
    source: final_report.company_results
    description: Detailed financial data for all processed companies
  extraction_summary:
    type: object
    source: final_report.extraction_summary
    description: Technical details about extraction process
  processing_summary:
    type: object
    source: final_report.processing_summary
    description: Summary of companies processed and extraction success rates
  workflow_execution:
    type: object
    source: final_report.workflow_execution
    description: Workflow execution metadata and status
version: '1.0'
description: Extract financial data from PDFs in Google Drive using Gemini AI
timeout_seconds: 7200

Execution ID	Status	Started	Duration	Actions
`ab9c3d49...`	COMPLETED	2025-08-01 09:53:49	N/A	View
`15d5fc07...`	COMPLETED	2025-08-01 09:21:09	N/A	View
`23854d06...`	COMPLETED	2025-08-01 08:20:57	N/A	View