Financial Data Extraction and Google Drive Upload

Automated workflow to scrape financial reports and concall transcripts from screener.in and upload to Google Drive with organized folder structure

Back

Workflow Information

ID: financial_data_extraction_workflow

Namespace: finance

Version: 1.0

Created: 2025-07-29

Updated: 2025-07-29

Tasks: 5

Quick Actions

Manage Secrets

Inputs

Name	Type	Required	Default
`company_code`	string	Optional	`ULTRACEMCO`
`financial_years`	array	Optional	`['all']`
`include_concalls`	boolean	Optional	`True`
`nango_connection_id`	string	Optional	`e233fe88-9ee3-48b7-93a5-17a21091e79f`
`nango_key`	string	Optional	`8df3e2de-2307-48d3-94bd-ddd3fd6a62ec`
`main_folder_id`	string	Optional	`1W22-59ESyR-E_1PMVWevzL-WvlFALDl-`

Outputs

Name	Type	Source
`extraction_summary`	object	generate_extraction_log.summary
`extraction_log`	object	generate_extraction_log.extraction_log
`company_drive_folder_id`	string	initialize_drive_handler.company_folder_id
`annual_reports_processed`	integer	process_annual_reports.final_state.processed_count
`concall_transcripts_processed`	integer	process_concall_transcripts.final_state.processed_count

Tasks

initialize_drive_handler

script

Initialize Google Drive handler and create company folder

Python Script:

import json
import requests
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class GoogleDriveHandler:
    def __init__(self, nango_connection_id, nango_key):
        self.access_token = self.get_nango_access_token(nango_connection_id, nango_key)
        self.base_url = "https://www.googleapis.com/drive/v3"
        self.headers = {
            'Authorization': f'Bearer {self.access_token}',
            'Content-Type': 'application/json'
        }
    
    def get_nango_access_token(self, connection_id, nango_key):
        url = f"https://auth-dev.assistents.ai/connection/{connection_id}?provider_config_key=google-drive-hq3h"
        headers = {
            'Authorization': f'Bearer {nango_key}',
            'Content-Type': 'application/json'
        }
        
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        
        data = response.json()
        access_token = data['credentials']['access_token']
        logger.info("Successfully retrieved access token from Nango")
        return access_token
    
    def create_folder(self, folder_name, parent_folder_id):
        metadata = {
            'name': folder_name,
            'mimeType': 'application/vnd.google-apps.folder',
            'parents': [parent_folder_id]
        }
        
        response = requests.post(
            f"{self.base_url}/files",
            headers=self.headers,
            json=metadata
        )
        response.raise_for_status()
        
        folder_data = response.json()
        logger.info(f"Created folder: {folder_name} (ID: {folder_data['id']})")
        return folder_data
    
    def check_folder_exists(self, folder_name, parent_folder_id):
        query = f"name='{folder_name}' and mimeType='application/vnd.google-apps.folder' and '{parent_folder_id}' in parents and trashed=false"
        params = {
            'q': query,
            'fields': 'files(id, name)',
            'pageSize': 100
        }
        
        response = requests.get(f"{self.base_url}/files", headers=self.headers, params=params)
        response.raise_for_status()
        
        files = response.json().get('files', [])
        return files[0] if files else None
    
    def get_or_create_folder(self, folder_name, parent_folder_id):
        existing_folder = self.check_folder_exists(folder_name, parent_folder_id)
        if existing_folder:
            logger.info(f"Using existing folder: {folder_name} (ID: {existing_folder['id']})")
            return existing_folder
        else:
            return self.create_folder(folder_name, parent_folder_id)

# Initialize drive handler
nango_connection_id = "${nango_connection_id}"
nango_key = "${nango_key}"
main_folder_id = "${main_folder_id}"
company_code = "${company_code}"

try:
    drive_handler = GoogleDriveHandler(nango_connection_id, nango_key)
    
    # Create/get company folder
    company_folder = drive_handler.get_or_create_folder(company_code, main_folder_id)
    
    result = {
        "drive_handler_initialized": True,
        "company_folder_id": company_folder['id'],
        "company_folder_name": company_folder['name'],
        "access_token": drive_handler.access_token[:20] + "..." # Truncate for security
    }
    
    print(f"✓ Google Drive handler initialized successfully")
    print(f"✓ Company folder ready: {company_code} (ID: {company_folder['id']})")
    print(f"__OUTPUTS__ {json.dumps(result)}")
    
except Exception as e:
    error_result = {
        "drive_handler_initialized": False,
        "error": str(e)
    }
    print(f"✗ Failed to initialize Google Drive handler: {str(e)}")
    print(f"__OUTPUTS__ {json.dumps(error_result)}")
    raise

scrape_financial_data

script

Scrape annual reports and concall transcripts from screener.in

Dependencies: playwright

Python Script:

import json
import asyncio
import re
from playwright.async_api import async_playwright

def extract_financial_year(year_text):
    clean_text = year_text.replace("Financial Year ", "").strip()
    
    # Look for patterns like "2023-24" or "2023-2024"
    match = re.search(r'(\d{4})-(\d{2,4})', clean_text)
    if match:
        end_year = match.group(2)
        if len(end_year) == 2:
            return f"FY{end_year}"
        else:
            return f"FY{end_year[-2:]}"
    
    # Look for single year like "2024"
    match = re.search(r'(\d{4})', clean_text)
    if match:
        year = match.group(1)
        return f"FY{year[-2:]}"
    
    return f"FY{clean_text}"

async def scrape_data():
    company_code = "${company_code}"
    target_years = ${financial_years}
    include_concalls = ${include_concalls}
    
    url = f"https://www.screener.in/company/{company_code}/consolidated/"
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        page = await context.new_page()
        
        try:
            print(f"Navigating to {url}")
            await page.goto(url, wait_until='networkidle')
            await page.wait_for_selector('.documents.annual-reports', timeout=10000)
            
            # Extract annual reports
            annual_reports = await page.evaluate('''
                () => {
                    const reportSection = document.querySelector('.documents.annual-reports');
                    if (!reportSection) return [];
                    
                    const links = reportSection.querySelectorAll('ul.list-links li a');
                    return Array.from(links).map(link => ({
                        year: link.textContent.trim().split('\\n')[0],
                        url: link.href,
                        source: link.querySelector('.ink-600.smaller')?.textContent.trim() || ''
                    }));
                }
            ''')
            
            # Extract concall transcripts if requested
            concall_transcripts = []
            if include_concalls:
                concall_transcripts = await page.evaluate('''
                    () => {
                        const concallSection = document.querySelector('.documents.concalls');
                        if (!concallSection) return [];
                        
                        const items = concallSection.querySelectorAll('ul.list-links li');
                        const transcripts = [];
                        
                        items.forEach(item => {
                            const dateDiv = item.querySelector('.ink-600.font-size-15.font-weight-500.nowrap');
                            const transcriptLink = item.querySelector('a.concall-link[title="Raw Transcript"]');
                            
                            if (dateDiv && transcriptLink) {
                                transcripts.push({
                                    date: dateDiv.textContent.trim(),
                                    url: transcriptLink.href,
                                    type: 'transcript'
                                });
                            }
                        });
                        
                        return transcripts;
                    }
                ''')
            
            return {
                "annual_reports": annual_reports,
                "concall_transcripts": concall_transcripts
            }
            
        finally:
            await browser.close()

# Run scraping
try:
    scraped_data = asyncio.run(scrape_data())
    
    # Filter by target years if specified
    target_years = ${financial_years}
    if target_years != ["all"]:
        filtered_reports = []
        for report in scraped_data["annual_reports"]:
            fy = extract_financial_year(report['year'])
            if fy in target_years:
                report['financial_year'] = fy
                filtered_reports.append(report)
        scraped_data["annual_reports"] = filtered_reports
        
        filtered_concalls = []
        for transcript in scraped_data["concall_transcripts"]:
            # Extract FY from concall date
            date_text = transcript['date']
            fy_match = re.search(r'FY(\d{2})', date_text)
            if fy_match:
                fy = f"FY{fy_match.group(1)}"
            else:
                year_match = re.search(r'(\d{4})', date_text)
                if year_match:
                    year = int(year_match.group(1))
                    fy = f"FY{str(year)[-2:]}"
                else:
                    fy = "FY_Unknown"
            
            if fy in target_years:
                transcript['financial_year'] = fy
                filtered_concalls.append(transcript)
        scraped_data["concall_transcripts"] = filtered_concalls
    else:
        # Add financial year to all items
        for report in scraped_data["annual_reports"]:
            report['financial_year'] = extract_financial_year(report['year'])
        
        for transcript in scraped_data["concall_transcripts"]:
            date_text = transcript['date']
            fy_match = re.search(r'FY(\d{2})', date_text)
            if fy_match:
                fy = f"FY{fy_match.group(1)}"
            else:
                year_match = re.search(r'(\d{4})', date_text)
                if year_match:
                    year = int(year_match.group(1))
                    fy = f"FY{str(year)[-2:]}"
                else:
                    fy = "FY_Unknown"
            transcript['financial_year'] = fy
    
    result = {
        "scraping_successful": True,
        "company_code": "${company_code}",
        "annual_reports_found": len(scraped_data["annual_reports"]),
        "concall_transcripts_found": len(scraped_data["concall_transcripts"]),
        "annual_reports": scraped_data["annual_reports"],
        "concall_transcripts": scraped_data["concall_transcripts"]
    }
    
    print(f"✓ Scraping completed for ${company_code}")
    print(f"✓ Found {len(scraped_data['annual_reports'])} annual reports")
    print(f"✓ Found {len(scraped_data['concall_transcripts'])} concall transcripts")
    print(f"__OUTPUTS__ {json.dumps(result)}")
    
except Exception as e:
    error_result = {
        "scraping_successful": False,
        "error": str(e)
    }
    print(f"✗ Scraping failed: {str(e)}")
    print(f"__OUTPUTS__ {json.dumps(error_result)}")
    raise

process_annual_reports

loop

Process all annual reports in a loop

Loop Configuration

Type: for
Max Iterations: ${scrape_financial_data.annual_reports_found}

Iterator Variable: report_index
State Variables: error_count, access_token, uploaded_count, processed_count, company_folder_id, processed_reports

Loop Flow (1 steps)

Download and Upload Annual Report script

import json
import requests
import re
import time
from datetime import datetime

# Get current report
annual_reports = ${scrape_financial_data.annual_reports}
report_index = inputs.get('report_index', 0)

if report_index >= len(annual_reports):
    print(f"Report index {report_index} out of range")
    outputs["status"] = "skipped"
    return

current_report = annual_reports[report_index]
company_code = "${company_code}"
company_folder_id = loop_state.get('company_folder_id')
access_token = loop_state.get('access_token')

print(f"Processing report {report_index + 1}/{len(annual_reports)}: {current_report['year']}")

try:
    # Setup Drive API headers
    headers = {'Authorization': f'Bearer {access_token}'}
    base_url = "https://www.googleapis.com/drive/v3"
    
    # Get/Create FY folder
    fy = current_report['financial_year']
    
    # Create FY folder
    fy_folder_metadata = {
        'name': fy,
        'mimeType': 'application/vnd.google-apps.folder',
        'parents': [company_folder_id]
    }
    
    # Check if FY folder exists
    query = f"name='{fy}' and mimeType='application/vnd.google-apps.folder' and '{company_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_folders = response.json().get('files', [])
    if existing_folders:
        fy_folder_id = existing_folders[0]['id']
        print(f"Using existing FY folder: {fy}")
    else:
        response = requests.post(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=fy_folder_metadata)
        response.raise_for_status()
        fy_folder_id = response.json()['id']
        print(f"Created FY folder: {fy}")
    
    # Create/Get "Yearly report" subfolder
    yearly_report_metadata = {
        'name': 'Yearly report',
        'mimeType': 'application/vnd.google-apps.folder',
        'parents': [fy_folder_id]
    }
    
    query = f"name='Yearly report' and mimeType='application/vnd.google-apps.folder' and '{fy_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_yearly_folders = response.json().get('files', [])
    if existing_yearly_folders:
        yearly_folder_id = existing_yearly_folders[0]['id']
    else:
        response = requests.post(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=yearly_report_metadata)
        response.raise_for_status()
        yearly_folder_id = response.json()['id']
    
    # Download and upload file
    filename = f"{company_code}_{fy}_Annual_Report.pdf"
    
    # Check if file already exists
    query = f"name='{filename}' and '{yearly_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_files = response.json().get('files', [])
    if existing_files:
        print(f"✓ {filename} already exists in Drive, skipping...")
        upload_result = existing_files[0]
        status = "already_exists"
    else:
        print(f"Downloading {filename}...")
        
        # Download file
        download_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        download_response = requests.get(current_report['url'], headers=download_headers, timeout=120, stream=True)
        download_response.raise_for_status()
        
        file_content = b''
        for chunk in download_response.iter_content(chunk_size=8192):
            file_content += chunk
        
        file_size = len(file_content)
        print(f"Downloaded {filename} ({file_size:,} bytes)")
        
        # Upload to Drive
        upload_metadata = {
            'name': filename,
            'parents': [yearly_folder_id]
        }
        
        files = {
            'data': ('metadata', json.dumps(upload_metadata), 'application/json; charset=UTF-8'),
            'file': (filename, file_content, 'application/pdf')
        }
        
        upload_response = requests.post(
            'https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart',
            headers={'Authorization': f'Bearer {access_token}'},
            files=files
        )
        upload_response.raise_for_status()
        
        upload_result = upload_response.json()
        print(f"✓ Uploaded {filename} to Drive (ID: {upload_result['id']})")
        status = "uploaded"
    
    # Update state
    processed_report = {
        "filename": filename,
        "financial_year": fy,
        "drive_file_id": upload_result['id'],
        "status": status,
        "processed_at": datetime.now().isoformat(),
        "source_url": current_report['url']
    }
    
    current_processed = loop_state.get('processed_reports', [])
    current_processed.append(processed_report)
    
    state_updates["processed_reports"] = current_processed
    state_updates["processed_count"] = loop_state.get('processed_count', 0) + 1
    
    if status == "uploaded":
        state_updates["uploaded_count"] = loop_state.get('uploaded_count', 0) + 1
    
    outputs["processed_report"] = processed_report
    outputs["status"] = "success"
    
    print(f"✓ Successfully processed {filename}")
    
except Exception as e:
    print(f"✗ Failed to process report {report_index + 1}: {str(e)}")
    
    state_updates["error_count"] = loop_state.get('error_count', 0) + 1
    state_updates["processed_count"] = loop_state.get('processed_count', 0) + 1
    
    outputs["error"] = str(e)
    outputs["status"] = "error"

# Small delay between processing
time.sleep(2)

process_concall_transcripts

loop

Process all concall transcripts in a loop

Loop Configuration

Type: for
Max Iterations: ${scrape_financial_data.concall_transcripts_found}

Iterator Variable: transcript_index
State Variables: error_count, access_token, uploaded_count, processed_count, company_folder_id, processed_transcripts

Loop Flow (1 steps)

Download and Upload Concall Transcript script

import json
import requests
import re
import time
from datetime import datetime

# Get current transcript
concall_transcripts = ${scrape_financial_data.concall_transcripts}
transcript_index = inputs.get('transcript_index', 0)

if transcript_index >= len(concall_transcripts):
    print(f"Transcript index {transcript_index} out of range")
    outputs["status"] = "skipped"
    return

current_transcript = concall_transcripts[transcript_index]
company_code = "${company_code}"
company_folder_id = loop_state.get('company_folder_id')
access_token = loop_state.get('access_token')

print(f"Processing transcript {transcript_index + 1}/{len(concall_transcripts)}: {current_transcript['date']}")

try:
    # Setup Drive API
    base_url = "https://www.googleapis.com/drive/v3"
    
    # Get/Create FY and Concall Reports folders (similar to annual reports)
    fy = current_transcript['financial_year']
    
    # Create FY folder (same logic as annual reports)
    fy_folder_metadata = {
        'name': fy,
        'mimeType': 'application/vnd.google-apps.folder',
        'parents': [company_folder_id]
    }
    
    query = f"name='{fy}' and mimeType='application/vnd.google-apps.folder' and '{company_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_folders = response.json().get('files', [])
    if existing_folders:
        fy_folder_id = existing_folders[0]['id']
    else:
        response = requests.post(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=fy_folder_metadata)
        response.raise_for_status()
        fy_folder_id = response.json()['id']
    
    # Create/Get "Concall Reports" subfolder
    concall_folder_metadata = {
        'name': 'Concall Reports',
        'mimeType': 'application/vnd.google-apps.folder',
        'parents': [fy_folder_id]
    }
    
    query = f"name='Concall Reports' and mimeType='application/vnd.google-apps.folder' and '{fy_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_concall_folders = response.json().get('files', [])
    if existing_concall_folders:
        concall_folder_id = existing_concall_folders[0]['id']
    else:
        response = requests.post(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=concall_folder_metadata)
        response.raise_for_status()
        concall_folder_id = response.json()['id']
    
    # Download and upload transcript
    date_clean = current_transcript['date'].replace(' ', '_').replace(':', '_')
    filename = f"{company_code}_{fy}_{date_clean}_Concall_Transcript.pdf"
    
    # Check if file already exists
    query = f"name='{filename}' and '{concall_folder_id}' in parents and trashed=false"
    params = {'q': query, 'fields': 'files(id, name)'}
    
    response = requests.get(f"{base_url}/files", headers={'Authorization': f'Bearer {access_token}'}, params=params)
    response.raise_for_status()
    
    existing_files = response.json().get('files', [])
    if existing_files:
        print(f"✓ {filename} already exists in Drive, skipping...")
        upload_result = existing_files[0]
        status = "already_exists"
    else:
        print(f"Downloading {filename}...")
        
        # Download file
        download_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        download_response = requests.get(current_transcript['url'], headers=download_headers, timeout=120, stream=True)
        download_response.raise_for_status()
        
        file_content = b''
        for chunk in download_response.iter_content(chunk_size=8192):
            file_content += chunk
        
        file_size = len(file_content)
        print(f"Downloaded {filename} ({file_size:,} bytes)")
        
        # Upload to Drive
        upload_metadata = {
            'name': filename,
            'parents': [concall_folder_id]
        }
        
        files = {
            'data': ('metadata', json.dumps(upload_metadata), 'application/json; charset=UTF-8'),
            'file': (filename, file_content, 'application/pdf')
        }
        
        upload_response = requests.post(
            'https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart',
            headers={'Authorization': f'Bearer {access_token}'},
            files=files
        )
        upload_response.raise_for_status()
        
        upload_result = upload_response.json()
        print(f"✓ Uploaded {filename} to Drive (ID: {upload_result['id']})")
        status = "uploaded"
    
    # Update state
    processed_transcript = {
        "filename": filename,
        "financial_year": fy,
        "date": current_transcript['date'],
        "drive_file_id": upload_result['id'],
        "status": status,
        "processed_at": datetime.now().isoformat(),
        "source_url": current_transcript['url']
    }
    
    current_processed = loop_state.get('processed_transcripts', [])
    current_processed.append(processed_transcript)
    
    state_updates["processed_transcripts"] = current_processed
    state_updates["processed_count"] = loop_state.get('processed_count', 0) + 1
    
    if status == "uploaded":
        state_updates["uploaded_count"] = loop_state.get('uploaded_count', 0) + 1
    
    outputs["processed_transcript"] = processed_transcript
    outputs["status"] = "success"
    
    print(f"✓ Successfully processed {filename}")
    
except Exception as e:
    print(f"✗ Failed to process transcript {transcript_index + 1}: {str(e)}")
    
    state_updates["error_count"] = loop_state.get('error_count', 0) + 1
    state_updates["processed_count"] = loop_state.get('processed_count', 0) + 1
    
    outputs["error"] = str(e)
    outputs["status"] = "error"

# Small delay between processing
time.sleep(2)

generate_extraction_log

script

Generate comprehensive extraction log and summary

Python Script:

import json
from datetime import datetime

company_code = "${company_code}"
company_folder_id = "${initialize_drive_handler.company_folder_id}"

# Get processing results
try:
    annual_reports_state = ${process_annual_reports.final_state}
    processed_reports = annual_reports_state.get('processed_reports', [])
    reports_uploaded = annual_reports_state.get('uploaded_count', 0)
    reports_processed = annual_reports_state.get('processed_count', 0)
    reports_errors = annual_reports_state.get('error_count', 0)
except:
    processed_reports = []
    reports_uploaded = 0
    reports_processed = 0
    reports_errors = 0

try:
    if ${include_concalls}:
        concall_transcripts_state = ${process_concall_transcripts.final_state}
        processed_transcripts = concall_transcripts_state.get('processed_transcripts', [])
        transcripts_uploaded = concall_transcripts_state.get('uploaded_count', 0)
        transcripts_processed = concall_transcripts_state.get('processed_count', 0)
        transcripts_errors = concall_transcripts_state.get('error_count', 0)
    else:
        processed_transcripts = []
        transcripts_uploaded = 0
        transcripts_processed = 0
        transcripts_errors = 0
except:
    processed_transcripts = []
    transcripts_uploaded = 0
    transcripts_processed = 0
    transcripts_errors = 0

# Group by financial year
financial_years = {}

for report in processed_reports:
    fy = report['financial_year']
    if fy not in financial_years:
        financial_years[fy] = {
            "annual_reports": [],
            "concall_reports": [],
            "last_updated": None,
            "drive_folder_id": None
        }
    financial_years[fy]["annual_reports"].append(report)
    financial_years[fy]["last_updated"] = report['processed_at']

for transcript in processed_transcripts:
    fy = transcript['financial_year']
    if fy not in financial_years:
        financial_years[fy] = {
            "annual_reports": [],
            "concall_reports": [],
            "last_updated": None,
            "drive_folder_id": None
        }
    financial_years[fy]["concall_reports"].append(transcript)
    if not financial_years[fy]["last_updated"] or transcript['processed_at'] > financial_years[fy]["last_updated"]:
        financial_years[fy]["last_updated"] = transcript['processed_at']

# Create extraction log
extraction_log = {
    company_code: {
        "company_code": company_code,
        "last_updated": datetime.now().isoformat(),
        "financial_years": financial_years,
        "total_reports": len(processed_reports),
        "total_concalls": len(processed_transcripts),
        "drive_folder_id": company_folder_id,
        "processing_summary": {
            "reports_processed": reports_processed,
            "reports_uploaded": reports_uploaded,
            "reports_errors": reports_errors,
            "transcripts_processed": transcripts_processed,
            "transcripts_uploaded": transcripts_uploaded,
            "transcripts_errors": transcripts_errors
        }
    }
}

result = {
    "extraction_log": extraction_log,
    "summary": {
        "company_code": company_code,
        "drive_company_folder_id": company_folder_id,
        "financial_years_processed": list(financial_years.keys()),
        "total_reports_processed": reports_processed,
        "total_reports_uploaded": reports_uploaded,
        "total_transcripts_processed": transcripts_processed,
        "total_transcripts_uploaded": transcripts_uploaded,
        "total_errors": reports_errors + transcripts_errors
    }
}

print("=" * 60)
print(f"EXTRACTION SUMMARY FOR {company_code}")
print("=" * 60)
print(f"Company Drive Folder ID: {company_folder_id}")
print(f"Financial Years Processed: {', '.join(sorted(financial_years.keys()))}")
print(f"Annual Reports - Processed: {reports_processed}, Uploaded: {reports_uploaded}, Errors: {reports_errors}")
print(f"Concall Transcripts - Processed: {transcripts_processed}, Uploaded: {transcripts_uploaded}, Errors: {transcripts_errors}")
print("=" * 60)

print(f"__OUTPUTS__ {json.dumps(result)}")

YAML Source

name: Financial Data Extraction Workflow
retry:
  retryOn:
  - TEMPORARY_FAILURE
  - NETWORK_ERROR
  maxDelay: 30s
  maxAttempts: 3
  initialDelay: 5s
  backoffMultiplier: 2.0
tasks:
- id: initialize_drive_handler
  name: Initialize Google Drive Handler
  type: script
  script: "import json\nimport requests\nimport logging\n\nlogging.basicConfig(level=logging.INFO)\n\
    logger = logging.getLogger(__name__)\n\nclass GoogleDriveHandler:\n    def __init__(self,\
    \ nango_connection_id, nango_key):\n        self.access_token = self.get_nango_access_token(nango_connection_id,\
    \ nango_key)\n        self.base_url = \"https://www.googleapis.com/drive/v3\"\n\
    \        self.headers = {\n            'Authorization': f'Bearer {self.access_token}',\n\
    \            'Content-Type': 'application/json'\n        }\n    \n    def get_nango_access_token(self,\
    \ connection_id, nango_key):\n        url = f\"https://auth-dev.assistents.ai/connection/{connection_id}?provider_config_key=google-drive-hq3h\"\
    \n        headers = {\n            'Authorization': f'Bearer {nango_key}',\n \
    \           'Content-Type': 'application/json'\n        }\n        \n        response\
    \ = requests.get(url, headers=headers)\n        response.raise_for_status()\n\
    \        \n        data = response.json()\n        access_token = data['credentials']['access_token']\n\
    \        logger.info(\"Successfully retrieved access token from Nango\")\n   \
    \     return access_token\n    \n    def create_folder(self, folder_name, parent_folder_id):\n\
    \        metadata = {\n            'name': folder_name,\n            'mimeType':\
    \ 'application/vnd.google-apps.folder',\n            'parents': [parent_folder_id]\n\
    \        }\n        \n        response = requests.post(\n            f\"{self.base_url}/files\"\
    ,\n            headers=self.headers,\n            json=metadata\n        )\n \
    \       response.raise_for_status()\n        \n        folder_data = response.json()\n\
    \        logger.info(f\"Created folder: {folder_name} (ID: {folder_data['id']})\"\
    )\n        return folder_data\n    \n    def check_folder_exists(self, folder_name,\
    \ parent_folder_id):\n        query = f\"name='{folder_name}' and mimeType='application/vnd.google-apps.folder'\
    \ and '{parent_folder_id}' in parents and trashed=false\"\n        params = {\n\
    \            'q': query,\n            'fields': 'files(id, name)',\n         \
    \   'pageSize': 100\n        }\n        \n        response = requests.get(f\"\
    {self.base_url}/files\", headers=self.headers, params=params)\n        response.raise_for_status()\n\
    \        \n        files = response.json().get('files', [])\n        return files[0]\
    \ if files else None\n    \n    def get_or_create_folder(self, folder_name, parent_folder_id):\n\
    \        existing_folder = self.check_folder_exists(folder_name, parent_folder_id)\n\
    \        if existing_folder:\n            logger.info(f\"Using existing folder:\
    \ {folder_name} (ID: {existing_folder['id']})\")\n            return existing_folder\n\
    \        else:\n            return self.create_folder(folder_name, parent_folder_id)\n\
    \n# Initialize drive handler\nnango_connection_id = \"${nango_connection_id}\"\
    \nnango_key = \"${nango_key}\"\nmain_folder_id = \"${main_folder_id}\"\ncompany_code\
    \ = \"${company_code}\"\n\ntry:\n    drive_handler = GoogleDriveHandler(nango_connection_id,\
    \ nango_key)\n    \n    # Create/get company folder\n    company_folder = drive_handler.get_or_create_folder(company_code,\
    \ main_folder_id)\n    \n    result = {\n        \"drive_handler_initialized\"\
    : True,\n        \"company_folder_id\": company_folder['id'],\n        \"company_folder_name\"\
    : company_folder['name'],\n        \"access_token\": drive_handler.access_token[:20]\
    \ + \"...\" # Truncate for security\n    }\n    \n    print(f\"\u2713 Google Drive\
    \ handler initialized successfully\")\n    print(f\"\u2713 Company folder ready:\
    \ {company_code} (ID: {company_folder['id']})\")\n    print(f\"__OUTPUTS__ {json.dumps(result)}\"\
    )\n    \nexcept Exception as e:\n    error_result = {\n        \"drive_handler_initialized\"\
    : False,\n        \"error\": str(e)\n    }\n    print(f\"\u2717 Failed to initialize\
    \ Google Drive handler: {str(e)}\")\n    print(f\"__OUTPUTS__ {json.dumps(error_result)}\"\
    )\n    raise\n"
  description: Initialize Google Drive handler and create company folder
  timeout_seconds: 120
- id: scrape_financial_data
  name: Scrape Financial Data from Screener.in
  type: script
  script: "import json\nimport asyncio\nimport re\nfrom playwright.async_api import\
    \ async_playwright\n\ndef extract_financial_year(year_text):\n    clean_text =\
    \ year_text.replace(\"Financial Year \", \"\").strip()\n    \n    # Look for patterns\
    \ like \"2023-24\" or \"2023-2024\"\n    match = re.search(r'(\\d{4})-(\\d{2,4})',\
    \ clean_text)\n    if match:\n        end_year = match.group(2)\n        if len(end_year)\
    \ == 2:\n            return f\"FY{end_year}\"\n        else:\n            return\
    \ f\"FY{end_year[-2:]}\"\n    \n    # Look for single year like \"2024\"\n   \
    \ match = re.search(r'(\\d{4})', clean_text)\n    if match:\n        year = match.group(1)\n\
    \        return f\"FY{year[-2:]}\"\n    \n    return f\"FY{clean_text}\"\n\nasync\
    \ def scrape_data():\n    company_code = \"${company_code}\"\n    target_years\
    \ = ${financial_years}\n    include_concalls = ${include_concalls}\n    \n   \
    \ url = f\"https://www.screener.in/company/{company_code}/consolidated/\"\n  \
    \  \n    async with async_playwright() as p:\n        browser = await p.chromium.launch(headless=True)\n\
    \        context = await browser.new_context(\n            user_agent='Mozilla/5.0\
    \ (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n        )\n        page =\
    \ await context.new_page()\n        \n        try:\n            print(f\"Navigating\
    \ to {url}\")\n            await page.goto(url, wait_until='networkidle')\n  \
    \          await page.wait_for_selector('.documents.annual-reports', timeout=10000)\n\
    \            \n            # Extract annual reports\n            annual_reports\
    \ = await page.evaluate('''\n                () => {\n                    const\
    \ reportSection = document.querySelector('.documents.annual-reports');\n     \
    \               if (!reportSection) return [];\n                    \n       \
    \             const links = reportSection.querySelectorAll('ul.list-links li a');\n\
    \                    return Array.from(links).map(link => ({\n               \
    \         year: link.textContent.trim().split('\\\\n')[0],\n                 \
    \       url: link.href,\n                        source: link.querySelector('.ink-600.smaller')?.textContent.trim()\
    \ || ''\n                    }));\n                }\n            ''')\n     \
    \       \n            # Extract concall transcripts if requested\n           \
    \ concall_transcripts = []\n            if include_concalls:\n               \
    \ concall_transcripts = await page.evaluate('''\n                    () => {\n\
    \                        const concallSection = document.querySelector('.documents.concalls');\n\
    \                        if (!concallSection) return [];\n                   \
    \     \n                        const items = concallSection.querySelectorAll('ul.list-links\
    \ li');\n                        const transcripts = [];\n                   \
    \     \n                        items.forEach(item => {\n                    \
    \        const dateDiv = item.querySelector('.ink-600.font-size-15.font-weight-500.nowrap');\n\
    \                            const transcriptLink = item.querySelector('a.concall-link[title=\"\
    Raw Transcript\"]');\n                            \n                         \
    \   if (dateDiv && transcriptLink) {\n                                transcripts.push({\n\
    \                                    date: dateDiv.textContent.trim(),\n     \
    \                               url: transcriptLink.href,\n                  \
    \                  type: 'transcript'\n                                });\n \
    \                           }\n                        });\n                 \
    \       \n                        return transcripts;\n                    }\n\
    \                ''')\n            \n            return {\n                \"\
    annual_reports\": annual_reports,\n                \"concall_transcripts\": concall_transcripts\n\
    \            }\n            \n        finally:\n            await browser.close()\n\
    \n# Run scraping\ntry:\n    scraped_data = asyncio.run(scrape_data())\n    \n\
    \    # Filter by target years if specified\n    target_years = ${financial_years}\n\
    \    if target_years != [\"all\"]:\n        filtered_reports = []\n        for\
    \ report in scraped_data[\"annual_reports\"]:\n            fy = extract_financial_year(report['year'])\n\
    \            if fy in target_years:\n                report['financial_year']\
    \ = fy\n                filtered_reports.append(report)\n        scraped_data[\"\
    annual_reports\"] = filtered_reports\n        \n        filtered_concalls = []\n\
    \        for transcript in scraped_data[\"concall_transcripts\"]:\n          \
    \  # Extract FY from concall date\n            date_text = transcript['date']\n\
    \            fy_match = re.search(r'FY(\\d{2})', date_text)\n            if fy_match:\n\
    \                fy = f\"FY{fy_match.group(1)}\"\n            else:\n        \
    \        year_match = re.search(r'(\\d{4})', date_text)\n                if year_match:\n\
    \                    year = int(year_match.group(1))\n                    fy =\
    \ f\"FY{str(year)[-2:]}\"\n                else:\n                    fy = \"\
    FY_Unknown\"\n            \n            if fy in target_years:\n             \
    \   transcript['financial_year'] = fy\n                filtered_concalls.append(transcript)\n\
    \        scraped_data[\"concall_transcripts\"] = filtered_concalls\n    else:\n\
    \        # Add financial year to all items\n        for report in scraped_data[\"\
    annual_reports\"]:\n            report['financial_year'] = extract_financial_year(report['year'])\n\
    \        \n        for transcript in scraped_data[\"concall_transcripts\"]:\n\
    \            date_text = transcript['date']\n            fy_match = re.search(r'FY(\\\
    d{2})', date_text)\n            if fy_match:\n                fy = f\"FY{fy_match.group(1)}\"\
    \n            else:\n                year_match = re.search(r'(\\d{4})', date_text)\n\
    \                if year_match:\n                    year = int(year_match.group(1))\n\
    \                    fy = f\"FY{str(year)[-2:]}\"\n                else:\n   \
    \                 fy = \"FY_Unknown\"\n            transcript['financial_year']\
    \ = fy\n    \n    result = {\n        \"scraping_successful\": True,\n       \
    \ \"company_code\": \"${company_code}\",\n        \"annual_reports_found\": len(scraped_data[\"\
    annual_reports\"]),\n        \"concall_transcripts_found\": len(scraped_data[\"\
    concall_transcripts\"]),\n        \"annual_reports\": scraped_data[\"annual_reports\"\
    ],\n        \"concall_transcripts\": scraped_data[\"concall_transcripts\"]\n \
    \   }\n    \n    print(f\"\u2713 Scraping completed for ${company_code}\")\n \
    \   print(f\"\u2713 Found {len(scraped_data['annual_reports'])} annual reports\"\
    )\n    print(f\"\u2713 Found {len(scraped_data['concall_transcripts'])} concall\
    \ transcripts\")\n    print(f\"__OUTPUTS__ {json.dumps(result)}\")\n    \nexcept\
    \ Exception as e:\n    error_result = {\n        \"scraping_successful\": False,\n\
    \        \"error\": str(e)\n    }\n    print(f\"\u2717 Scraping failed: {str(e)}\"\
    )\n    print(f\"__OUTPUTS__ {json.dumps(error_result)}\")\n    raise\n"
  depends_on:
  - initialize_drive_handler
  description: Scrape annual reports and concall transcripts from screener.in
  dependencies:
  - playwright
  timeout_seconds: 300
- id: process_annual_reports
  name: Process Annual Reports
  type: loop
  loop_type: for
  depends_on:
  - scrape_financial_data
  - initialize_drive_handler
  loop_tasks:
  - id: download_and_upload_report
    name: Download and Upload Annual Report
    type: script
    script: "import json\nimport requests\nimport re\nimport time\nfrom datetime import\
      \ datetime\n\n# Get current report\nannual_reports = ${scrape_financial_data.annual_reports}\n\
      report_index = inputs.get('report_index', 0)\n\nif report_index >= len(annual_reports):\n\
      \    print(f\"Report index {report_index} out of range\")\n    outputs[\"status\"\
      ] = \"skipped\"\n    return\n\ncurrent_report = annual_reports[report_index]\n\
      company_code = \"${company_code}\"\ncompany_folder_id = loop_state.get('company_folder_id')\n\
      access_token = loop_state.get('access_token')\n\nprint(f\"Processing report\
      \ {report_index + 1}/{len(annual_reports)}: {current_report['year']}\")\n\n\
      try:\n    # Setup Drive API headers\n    headers = {'Authorization': f'Bearer\
      \ {access_token}'}\n    base_url = \"https://www.googleapis.com/drive/v3\"\n\
      \    \n    # Get/Create FY folder\n    fy = current_report['financial_year']\n\
      \    \n    # Create FY folder\n    fy_folder_metadata = {\n        'name': fy,\n\
      \        'mimeType': 'application/vnd.google-apps.folder',\n        'parents':\
      \ [company_folder_id]\n    }\n    \n    # Check if FY folder exists\n    query\
      \ = f\"name='{fy}' and mimeType='application/vnd.google-apps.folder' and '{company_folder_id}'\
      \ in parents and trashed=false\"\n    params = {'q': query, 'fields': 'files(id,\
      \ name)'}\n    \n    response = requests.get(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}'}, params=params)\n    response.raise_for_status()\n\
      \    \n    existing_folders = response.json().get('files', [])\n    if existing_folders:\n\
      \        fy_folder_id = existing_folders[0]['id']\n        print(f\"Using existing\
      \ FY folder: {fy}\")\n    else:\n        response = requests.post(f\"{base_url}/files\"\
      , headers={'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json'},\
      \ json=fy_folder_metadata)\n        response.raise_for_status()\n        fy_folder_id\
      \ = response.json()['id']\n        print(f\"Created FY folder: {fy}\")\n   \
      \ \n    # Create/Get \"Yearly report\" subfolder\n    yearly_report_metadata\
      \ = {\n        'name': 'Yearly report',\n        'mimeType': 'application/vnd.google-apps.folder',\n\
      \        'parents': [fy_folder_id]\n    }\n    \n    query = f\"name='Yearly\
      \ report' and mimeType='application/vnd.google-apps.folder' and '{fy_folder_id}'\
      \ in parents and trashed=false\"\n    params = {'q': query, 'fields': 'files(id,\
      \ name)'}\n    \n    response = requests.get(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}'}, params=params)\n    response.raise_for_status()\n\
      \    \n    existing_yearly_folders = response.json().get('files', [])\n    if\
      \ existing_yearly_folders:\n        yearly_folder_id = existing_yearly_folders[0]['id']\n\
      \    else:\n        response = requests.post(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=yearly_report_metadata)\n\
      \        response.raise_for_status()\n        yearly_folder_id = response.json()['id']\n\
      \    \n    # Download and upload file\n    filename = f\"{company_code}_{fy}_Annual_Report.pdf\"\
      \n    \n    # Check if file already exists\n    query = f\"name='{filename}'\
      \ and '{yearly_folder_id}' in parents and trashed=false\"\n    params = {'q':\
      \ query, 'fields': 'files(id, name)'}\n    \n    response = requests.get(f\"\
      {base_url}/files\", headers={'Authorization': f'Bearer {access_token}'}, params=params)\n\
      \    response.raise_for_status()\n    \n    existing_files = response.json().get('files',\
      \ [])\n    if existing_files:\n        print(f\"\u2713 {filename} already exists\
      \ in Drive, skipping...\")\n        upload_result = existing_files[0]\n    \
      \    status = \"already_exists\"\n    else:\n        print(f\"Downloading {filename}...\"\
      )\n        \n        # Download file\n        download_headers = {\n       \
      \     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n\
      \        }\n        \n        download_response = requests.get(current_report['url'],\
      \ headers=download_headers, timeout=120, stream=True)\n        download_response.raise_for_status()\n\
      \        \n        file_content = b''\n        for chunk in download_response.iter_content(chunk_size=8192):\n\
      \            file_content += chunk\n        \n        file_size = len(file_content)\n\
      \        print(f\"Downloaded {filename} ({file_size:,} bytes)\")\n        \n\
      \        # Upload to Drive\n        upload_metadata = {\n            'name':\
      \ filename,\n            'parents': [yearly_folder_id]\n        }\n        \n\
      \        files = {\n            'data': ('metadata', json.dumps(upload_metadata),\
      \ 'application/json; charset=UTF-8'),\n            'file': (filename, file_content,\
      \ 'application/pdf')\n        }\n        \n        upload_response = requests.post(\n\
      \            'https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart',\n\
      \            headers={'Authorization': f'Bearer {access_token}'},\n        \
      \    files=files\n        )\n        upload_response.raise_for_status()\n  \
      \      \n        upload_result = upload_response.json()\n        print(f\"\u2713\
      \ Uploaded {filename} to Drive (ID: {upload_result['id']})\")\n        status\
      \ = \"uploaded\"\n    \n    # Update state\n    processed_report = {\n     \
      \   \"filename\": filename,\n        \"financial_year\": fy,\n        \"drive_file_id\"\
      : upload_result['id'],\n        \"status\": status,\n        \"processed_at\"\
      : datetime.now().isoformat(),\n        \"source_url\": current_report['url']\n\
      \    }\n    \n    current_processed = loop_state.get('processed_reports', [])\n\
      \    current_processed.append(processed_report)\n    \n    state_updates[\"\
      processed_reports\"] = current_processed\n    state_updates[\"processed_count\"\
      ] = loop_state.get('processed_count', 0) + 1\n    \n    if status == \"uploaded\"\
      :\n        state_updates[\"uploaded_count\"] = loop_state.get('uploaded_count',\
      \ 0) + 1\n    \n    outputs[\"processed_report\"] = processed_report\n    outputs[\"\
      status\"] = \"success\"\n    \n    print(f\"\u2713 Successfully processed {filename}\"\
      )\n    \nexcept Exception as e:\n    print(f\"\u2717 Failed to process report\
      \ {report_index + 1}: {str(e)}\")\n    \n    state_updates[\"error_count\"]\
      \ = loop_state.get('error_count', 0) + 1\n    state_updates[\"processed_count\"\
      ] = loop_state.get('processed_count', 0) + 1\n    \n    outputs[\"error\"] =\
      \ str(e)\n    outputs[\"status\"] = \"error\"\n\n# Small delay between processing\n\
      time.sleep(2)\n"
    description: Download and upload individual annual report
    timeout_seconds: 180
  description: Process all annual reports in a loop
  max_iterations: ${scrape_financial_data.annual_reports_found}
  state_variables:
    error_count: 0
    access_token: ${initialize_drive_handler.access_token}
    uploaded_count: 0
    processed_count: 0
    company_folder_id: ${initialize_drive_handler.company_folder_id}
    processed_reports: []
  iteration_variable: report_index
- id: process_concall_transcripts
  name: Process Concall Transcripts
  type: loop
  condition: ${include_concalls} == true
  loop_type: for
  depends_on:
  - scrape_financial_data
  - initialize_drive_handler
  loop_tasks:
  - id: download_and_upload_transcript
    name: Download and Upload Concall Transcript
    type: script
    script: "import json\nimport requests\nimport re\nimport time\nfrom datetime import\
      \ datetime\n\n# Get current transcript\nconcall_transcripts = ${scrape_financial_data.concall_transcripts}\n\
      transcript_index = inputs.get('transcript_index', 0)\n\nif transcript_index\
      \ >= len(concall_transcripts):\n    print(f\"Transcript index {transcript_index}\
      \ out of range\")\n    outputs[\"status\"] = \"skipped\"\n    return\n\ncurrent_transcript\
      \ = concall_transcripts[transcript_index]\ncompany_code = \"${company_code}\"\
      \ncompany_folder_id = loop_state.get('company_folder_id')\naccess_token = loop_state.get('access_token')\n\
      \nprint(f\"Processing transcript {transcript_index + 1}/{len(concall_transcripts)}:\
      \ {current_transcript['date']}\")\n\ntry:\n    # Setup Drive API\n    base_url\
      \ = \"https://www.googleapis.com/drive/v3\"\n    \n    # Get/Create FY and Concall\
      \ Reports folders (similar to annual reports)\n    fy = current_transcript['financial_year']\n\
      \    \n    # Create FY folder (same logic as annual reports)\n    fy_folder_metadata\
      \ = {\n        'name': fy,\n        'mimeType': 'application/vnd.google-apps.folder',\n\
      \        'parents': [company_folder_id]\n    }\n    \n    query = f\"name='{fy}'\
      \ and mimeType='application/vnd.google-apps.folder' and '{company_folder_id}'\
      \ in parents and trashed=false\"\n    params = {'q': query, 'fields': 'files(id,\
      \ name)'}\n    \n    response = requests.get(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}'}, params=params)\n    response.raise_for_status()\n\
      \    \n    existing_folders = response.json().get('files', [])\n    if existing_folders:\n\
      \        fy_folder_id = existing_folders[0]['id']\n    else:\n        response\
      \ = requests.post(f\"{base_url}/files\", headers={'Authorization': f'Bearer\
      \ {access_token}', 'Content-Type': 'application/json'}, json=fy_folder_metadata)\n\
      \        response.raise_for_status()\n        fy_folder_id = response.json()['id']\n\
      \    \n    # Create/Get \"Concall Reports\" subfolder\n    concall_folder_metadata\
      \ = {\n        'name': 'Concall Reports',\n        'mimeType': 'application/vnd.google-apps.folder',\n\
      \        'parents': [fy_folder_id]\n    }\n    \n    query = f\"name='Concall\
      \ Reports' and mimeType='application/vnd.google-apps.folder' and '{fy_folder_id}'\
      \ in parents and trashed=false\"\n    params = {'q': query, 'fields': 'files(id,\
      \ name)'}\n    \n    response = requests.get(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}'}, params=params)\n    response.raise_for_status()\n\
      \    \n    existing_concall_folders = response.json().get('files', [])\n   \
      \ if existing_concall_folders:\n        concall_folder_id = existing_concall_folders[0]['id']\n\
      \    else:\n        response = requests.post(f\"{base_url}/files\", headers={'Authorization':\
      \ f'Bearer {access_token}', 'Content-Type': 'application/json'}, json=concall_folder_metadata)\n\
      \        response.raise_for_status()\n        concall_folder_id = response.json()['id']\n\
      \    \n    # Download and upload transcript\n    date_clean = current_transcript['date'].replace('\
      \ ', '_').replace(':', '_')\n    filename = f\"{company_code}_{fy}_{date_clean}_Concall_Transcript.pdf\"\
      \n    \n    # Check if file already exists\n    query = f\"name='{filename}'\
      \ and '{concall_folder_id}' in parents and trashed=false\"\n    params = {'q':\
      \ query, 'fields': 'files(id, name)'}\n    \n    response = requests.get(f\"\
      {base_url}/files\", headers={'Authorization': f'Bearer {access_token}'}, params=params)\n\
      \    response.raise_for_status()\n    \n    existing_files = response.json().get('files',\
      \ [])\n    if existing_files:\n        print(f\"\u2713 {filename} already exists\
      \ in Drive, skipping...\")\n        upload_result = existing_files[0]\n    \
      \    status = \"already_exists\"\n    else:\n        print(f\"Downloading {filename}...\"\
      )\n        \n        # Download file\n        download_headers = {\n       \
      \     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n\
      \        }\n        \n        download_response = requests.get(current_transcript['url'],\
      \ headers=download_headers, timeout=120, stream=True)\n        download_response.raise_for_status()\n\
      \        \n        file_content = b''\n        for chunk in download_response.iter_content(chunk_size=8192):\n\
      \            file_content += chunk\n        \n        file_size = len(file_content)\n\
      \        print(f\"Downloaded {filename} ({file_size:,} bytes)\")\n        \n\
      \        # Upload to Drive\n        upload_metadata = {\n            'name':\
      \ filename,\n            'parents': [concall_folder_id]\n        }\n       \
      \ \n        files = {\n            'data': ('metadata', json.dumps(upload_metadata),\
      \ 'application/json; charset=UTF-8'),\n            'file': (filename, file_content,\
      \ 'application/pdf')\n        }\n        \n        upload_response = requests.post(\n\
      \            'https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart',\n\
      \            headers={'Authorization': f'Bearer {access_token}'},\n        \
      \    files=files\n        )\n        upload_response.raise_for_status()\n  \
      \      \n        upload_result = upload_response.json()\n        print(f\"\u2713\
      \ Uploaded {filename} to Drive (ID: {upload_result['id']})\")\n        status\
      \ = \"uploaded\"\n    \n    # Update state\n    processed_transcript = {\n \
      \       \"filename\": filename,\n        \"financial_year\": fy,\n        \"\
      date\": current_transcript['date'],\n        \"drive_file_id\": upload_result['id'],\n\
      \        \"status\": status,\n        \"processed_at\": datetime.now().isoformat(),\n\
      \        \"source_url\": current_transcript['url']\n    }\n    \n    current_processed\
      \ = loop_state.get('processed_transcripts', [])\n    current_processed.append(processed_transcript)\n\
      \    \n    state_updates[\"processed_transcripts\"] = current_processed\n  \
      \  state_updates[\"processed_count\"] = loop_state.get('processed_count', 0)\
      \ + 1\n    \n    if status == \"uploaded\":\n        state_updates[\"uploaded_count\"\
      ] = loop_state.get('uploaded_count', 0) + 1\n    \n    outputs[\"processed_transcript\"\
      ] = processed_transcript\n    outputs[\"status\"] = \"success\"\n    \n    print(f\"\
      \u2713 Successfully processed {filename}\")\n    \nexcept Exception as e:\n\
      \    print(f\"\u2717 Failed to process transcript {transcript_index + 1}: {str(e)}\"\
      )\n    \n    state_updates[\"error_count\"] = loop_state.get('error_count',\
      \ 0) + 1\n    state_updates[\"processed_count\"] = loop_state.get('processed_count',\
      \ 0) + 1\n    \n    outputs[\"error\"] = str(e)\n    outputs[\"status\"] = \"\
      error\"\n\n# Small delay between processing\ntime.sleep(2)\n"
    description: Download and upload individual concall transcript
    timeout_seconds: 180
  description: Process all concall transcripts in a loop
  max_iterations: ${scrape_financial_data.concall_transcripts_found}
  state_variables:
    error_count: 0
    access_token: ${initialize_drive_handler.access_token}
    uploaded_count: 0
    processed_count: 0
    company_folder_id: ${initialize_drive_handler.company_folder_id}
    processed_transcripts: []
  iteration_variable: transcript_index
- id: generate_extraction_log
  name: Generate Extraction Log
  type: script
  script: "import json\nfrom datetime import datetime\n\ncompany_code = \"${company_code}\"\
    \ncompany_folder_id = \"${initialize_drive_handler.company_folder_id}\"\n\n# Get\
    \ processing results\ntry:\n    annual_reports_state = ${process_annual_reports.final_state}\n\
    \    processed_reports = annual_reports_state.get('processed_reports', [])\n \
    \   reports_uploaded = annual_reports_state.get('uploaded_count', 0)\n    reports_processed\
    \ = annual_reports_state.get('processed_count', 0)\n    reports_errors = annual_reports_state.get('error_count',\
    \ 0)\nexcept:\n    processed_reports = []\n    reports_uploaded = 0\n    reports_processed\
    \ = 0\n    reports_errors = 0\n\ntry:\n    if ${include_concalls}:\n        concall_transcripts_state\
    \ = ${process_concall_transcripts.final_state}\n        processed_transcripts\
    \ = concall_transcripts_state.get('processed_transcripts', [])\n        transcripts_uploaded\
    \ = concall_transcripts_state.get('uploaded_count', 0)\n        transcripts_processed\
    \ = concall_transcripts_state.get('processed_count', 0)\n        transcripts_errors\
    \ = concall_transcripts_state.get('error_count', 0)\n    else:\n        processed_transcripts\
    \ = []\n        transcripts_uploaded = 0\n        transcripts_processed = 0\n\
    \        transcripts_errors = 0\nexcept:\n    processed_transcripts = []\n   \
    \ transcripts_uploaded = 0\n    transcripts_processed = 0\n    transcripts_errors\
    \ = 0\n\n# Group by financial year\nfinancial_years = {}\n\nfor report in processed_reports:\n\
    \    fy = report['financial_year']\n    if fy not in financial_years:\n      \
    \  financial_years[fy] = {\n            \"annual_reports\": [],\n            \"\
    concall_reports\": [],\n            \"last_updated\": None,\n            \"drive_folder_id\"\
    : None\n        }\n    financial_years[fy][\"annual_reports\"].append(report)\n\
    \    financial_years[fy][\"last_updated\"] = report['processed_at']\n\nfor transcript\
    \ in processed_transcripts:\n    fy = transcript['financial_year']\n    if fy\
    \ not in financial_years:\n        financial_years[fy] = {\n            \"annual_reports\"\
    : [],\n            \"concall_reports\": [],\n            \"last_updated\": None,\n\
    \            \"drive_folder_id\": None\n        }\n    financial_years[fy][\"\
    concall_reports\"].append(transcript)\n    if not financial_years[fy][\"last_updated\"\
    ] or transcript['processed_at'] > financial_years[fy][\"last_updated\"]:\n   \
    \     financial_years[fy][\"last_updated\"] = transcript['processed_at']\n\n#\
    \ Create extraction log\nextraction_log = {\n    company_code: {\n        \"company_code\"\
    : company_code,\n        \"last_updated\": datetime.now().isoformat(),\n     \
    \   \"financial_years\": financial_years,\n        \"total_reports\": len(processed_reports),\n\
    \        \"total_concalls\": len(processed_transcripts),\n        \"drive_folder_id\"\
    : company_folder_id,\n        \"processing_summary\": {\n            \"reports_processed\"\
    : reports_processed,\n            \"reports_uploaded\": reports_uploaded,\n  \
    \          \"reports_errors\": reports_errors,\n            \"transcripts_processed\"\
    : transcripts_processed,\n            \"transcripts_uploaded\": transcripts_uploaded,\n\
    \            \"transcripts_errors\": transcripts_errors\n        }\n    }\n}\n\
    \nresult = {\n    \"extraction_log\": extraction_log,\n    \"summary\": {\n  \
    \      \"company_code\": company_code,\n        \"drive_company_folder_id\": company_folder_id,\n\
    \        \"financial_years_processed\": list(financial_years.keys()),\n      \
    \  \"total_reports_processed\": reports_processed,\n        \"total_reports_uploaded\"\
    : reports_uploaded,\n        \"total_transcripts_processed\": transcripts_processed,\n\
    \        \"total_transcripts_uploaded\": transcripts_uploaded,\n        \"total_errors\"\
    : reports_errors + transcripts_errors\n    }\n}\n\nprint(\"=\" * 60)\nprint(f\"\
    EXTRACTION SUMMARY FOR {company_code}\")\nprint(\"=\" * 60)\nprint(f\"Company\
    \ Drive Folder ID: {company_folder_id}\")\nprint(f\"Financial Years Processed:\
    \ {', '.join(sorted(financial_years.keys()))}\")\nprint(f\"Annual Reports - Processed:\
    \ {reports_processed}, Uploaded: {reports_uploaded}, Errors: {reports_errors}\"\
    )\nprint(f\"Concall Transcripts - Processed: {transcripts_processed}, Uploaded:\
    \ {transcripts_uploaded}, Errors: {transcripts_errors}\")\nprint(\"=\" * 60)\n\
    \nprint(f\"__OUTPUTS__ {json.dumps(result)}\")\n"
  depends_on:
  - process_annual_reports
  - process_concall_transcripts
  description: Generate comprehensive extraction log and summary
inputs:
- name: company_code
  type: string
  default: ULTRACEMCO
  required: false
  validation:
    pattern: ^[A-Z0-9]+$
  description: Company code to scrape (e.g., BEL, DALBHARAT)
- name: financial_years
  type: array
  items:
    type: string
  default:
  - all
  required: false
  description: Financial years to extract (e.g., ['FY24', 'FY23']) or ['all'] for
    all years
- name: include_concalls
  type: boolean
  default: true
  required: false
  description: Whether to include concall transcripts
- name: nango_connection_id
  type: string
  default: e233fe88-9ee3-48b7-93a5-17a21091e79f
  required: false
  description: Nango connection ID for Google Drive access
- name: nango_key
  type: string
  default: 8df3e2de-2307-48d3-94bd-ddd3fd6a62ec
  required: false
  description: Nango API key for authentication
- name: main_folder_id
  type: string
  default: 1W22-59ESyR-E_1PMVWevzL-WvlFALDl-
  required: false
  description: Google Drive folder ID where company folders will be created
outputs:
- name: extraction_summary
  type: object
  source: generate_extraction_log.summary
  description: Summary of the extraction process
- name: extraction_log
  type: object
  source: generate_extraction_log.extraction_log
  description: Detailed extraction log with file information
- name: company_drive_folder_id
  type: string
  source: initialize_drive_handler.company_folder_id
  description: Google Drive folder ID for the company
- name: annual_reports_processed
  type: integer
  source: process_annual_reports.final_state.processed_count
  description: Number of annual reports processed
- name: concall_transcripts_processed
  type: integer
  source: process_concall_transcripts.final_state.processed_count
  description: Number of concall transcripts processed
version: '1.0'
description: Extracts annual reports and concall transcripts from screener.in and
  uploads to Google Drive
timeout_seconds: 3600

Execution ID	Status	Started	Duration	Actions
`3831182f...`	COMPLETED	2025-07-29 12:38:07	N/A	View
`244ffbd5...`	COMPLETED	2025-07-29 12:35:41	N/A	View
`2a4ec899...`	COMPLETED	2025-07-29 12:29:53	N/A	View
`d3ec646b...`	COMPLETED	2025-07-29 11:17:40	N/A	View