Case Study 20-2: Maya Monitors Job Boards for Consulting Opportunities

DataField.Dev

Case Study 20-2: Maya Monitors Job Boards for Consulting Opportunities

The Situation

Maya Reyes's consulting practice depends on a steady stream of new clients. New opportunities come in unpredictably — some months bring three strong leads, other months bring none. Her two primary sources for new work are word of mouth (which she cannot control) and job boards (which she can monitor).

Maya currently checks two job boards each morning: one general freelance platform and one specialized consulting-and-analytics marketplace. The manual process takes about fifteen minutes per day and is easy to forget on busy mornings. More importantly, on a freelance marketplace, good opportunities go fast — a posting might attract ten qualified bids within the first hour. Maya has missed opportunities by checking late.

Her goal: automate the daily check so that new opportunities matching her skills appear in a single CSV file each morning, ready for her review over coffee. She is not looking to auto-apply — she reviews every opportunity before responding — but she wants the filtering and aggregation done automatically.

The Skill Keywords

Maya's specialization is data analytics and business intelligence consulting. She is looking for opportunities that mention:

Data analytics / data analysis / business analytics
Business intelligence / BI
Python / pandas / SQL
Dashboard / reporting / visualization
Tableau / Power BI / Looker
Data strategy / KPI

She is NOT interested in: - Machine learning / AI / deep learning (not her specialty) - Software engineering / full-stack development - IT support / systems administration

Due Diligence

Platform 1: FreelanceHub (hypothetical)

robots.txt:

User-agent: *
Disallow: /private/
Disallow: /messages/
Disallow: /profile/edit/
Allow: /jobs/

Crawl-delay: 2

The public jobs listing pages are explicitly allowed. Terms of Service permit scraping public listings for personal use (finding opportunities for yourself is personal use).

Platform 2: AnalyticsGigs (hypothetical)

robots.txt:

User-agent: *
Disallow: /api/
Disallow: /search/
Allow: /listings/
Crawl-delay: 3

The /listings/ path is allowed; the /search/ path is not. Maya uses the direct listing URL rather than search results.

Inspecting the Page Structure

Maya opens FreelanceHub's job listings page and inspects the HTML (via browser Developer Tools → Inspect Element):

<div class="job-board">
  <article class="job-listing" data-job-id="87234" data-posted="2024-12-04">
    <div class="job-header">
      <h2 class="job-title">
        <a href="/jobs/87234/">Business Intelligence Dashboard Developer</a>
      </h2>
      <span class="client-name">Thornfield Media</span>
      <span class="posted-ago">Posted 3 hours ago</span>
    </div>
    <div class="job-details">
      <span class="budget-range">$75–$95/hour</span>
      <span class="duration">3–6 months</span>
      <span class="engagement-type">Contract</span>
    </div>
    <div class="job-description">
      <p>We need an experienced BI developer to build Tableau dashboards
      for our marketing analytics team. Python experience preferred.
      Must have strong SQL skills...</p>
    </div>
    <div class="job-tags">
      <span class="tag">Tableau</span>
      <span class="tag">SQL</span>
      <span class="tag">Python</span>
      <span class="tag">Business Intelligence</span>
    </div>
  </article>
  <!-- more listings... -->
</div>

The Script

"""
job_board_monitor.py
====================
Maya Reyes Consulting — Daily Job Opportunity Monitor

Scrapes job board listings, filters for relevant opportunities,
and saves matches to a CSV file. Run daily via cron or Task Scheduler.

Maya reviews the CSV each morning. No auto-applying.

Author: Maya Reyes
"""

import csv
import logging
import random
import re
import time
import urllib.robotparser
from dataclasses import dataclass, field
from datetime import date, datetime
from pathlib import Path
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

# ---------------------------------------------------------------------------
# Setup
# ---------------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s — %(levelname)s — %(message)s",
)
logger = logging.getLogger(__name__)

USER_AGENT = (
    "MayaReyes-OpportunityMonitor/1.0 "
    "(Personal consulting opportunity tracker; "
    "contact maya@mayareyesconsulting.com)"
)

HEADERS = {
    "User-Agent": USER_AGENT,
    "Accept": "text/html,application/xhtml+xml",
}

OUTPUT_DIR = Path("output")
OPPORTUNITIES_CSV = OUTPUT_DIR / "consulting_opportunities.csv"
RUN_LOG_CSV = OUTPUT_DIR / "monitor_run_log.csv"

# ---------------------------------------------------------------------------
# Keywords: what to look for and what to exclude
# ---------------------------------------------------------------------------

INCLUDE_KEYWORDS = [
    "data analytics",
    "data analysis",
    "business analytics",
    "business intelligence",
    " bi ",         # surrounded by spaces to avoid partial matches
    "python",
    "pandas",
    " sql ",
    "dashboard",
    "reporting",
    "visualization",
    "tableau",
    "power bi",
    "looker",
    "data strategy",
    "kpi",
    "metrics",
    "data consulting",
]

EXCLUDE_KEYWORDS = [
    "machine learning",
    "deep learning",
    "neural network",
    "natural language processing",
    "nlp",
    "full-stack",
    "full stack",
    "web development",
    "devops",
    "it support",
    "systems administrator",
    "sysadmin",
]

# ---------------------------------------------------------------------------
# Job board configurations
# ---------------------------------------------------------------------------

JOB_BOARDS = [
    {
        "name": "FreelanceHub",
        "base_url": "https://www.freelancehub.com",
        "listings_url": "https://www.freelancehub.com/jobs/analytics/",
        "crawl_delay": 2,
        "listing_selector": "article.job-listing",
        "title_selector": "h2.job-title a",
        "client_selector": "span.client-name",
        "budget_selector": "span.budget-range",
        "duration_selector": "span.duration",
        "description_selector": "div.job-description p",
        "tags_selector": "div.job-tags span.tag",
        "date_attr": "data-posted",      # attribute on the article element
        "id_attr": "data-job-id",
    },
    {
        "name": "AnalyticsGigs",
        "base_url": "https://www.analyticsgigs.com",
        "listings_url": "https://www.analyticsgigs.com/listings/consulting/",
        "crawl_delay": 3,
        "listing_selector": "div.gig-card",
        "title_selector": "h3.gig-title a",
        "client_selector": "span.company",
        "budget_selector": "div.rate-info",
        "duration_selector": "span.project-length",
        "description_selector": "p.gig-summary",
        "tags_selector": "ul.skill-tags li",
        "date_attr": "data-date",
        "id_attr": "data-gig-id",
    },
]


# ---------------------------------------------------------------------------
# Data structure
# ---------------------------------------------------------------------------

@dataclass
class JobListing:
    """Represents a single scraped job listing."""
    board: str
    job_id: str
    title: str
    client: str
    budget: str
    duration: str
    description: str
    tags: list[str]
    posted_date: str
    job_url: str
    scraped_at: str = field(default_factory=lambda: datetime.now().isoformat())
    is_relevant: bool = False
    matched_keywords: list[str] = field(default_factory=list)


# ---------------------------------------------------------------------------
# robots.txt compliance
# ---------------------------------------------------------------------------

def is_scraping_allowed(url: str) -> bool:
    """Check robots.txt for the given URL."""
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    try:
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        allowed = rp.can_fetch(USER_AGENT, url)
        logger.info(
            f"robots.txt ({parsed.netloc}): "
            f"{'ALLOWED' if allowed else 'DISALLOWED'}"
        )
        return allowed
    except Exception as e:
        logger.warning(f"Could not read robots.txt: {e}")
        return True


# ---------------------------------------------------------------------------
# Fetching
# ---------------------------------------------------------------------------

def fetch_page(url: str, crawl_delay: float) -> str | None:
    """Fetch a page, respecting the configured crawl delay."""
    delay = crawl_delay + random.uniform(0.5, 1.5)
    logger.debug(f"Waiting {delay:.1f}s before request...")
    time.sleep(delay)

    try:
        response = requests.get(url, headers=HEADERS, timeout=15)
        if response.status_code == 200:
            logger.info(f"Fetched: {url}")
            return response.text
        elif response.status_code == 429:
            logger.warning("Rate limited (429). Waiting 90s...")
            time.sleep(90)
            return None
        else:
            logger.error(f"HTTP {response.status_code}: {url}")
            return None
    except requests.RequestException as e:
        logger.error(f"Request failed: {e}")
        return None


# ---------------------------------------------------------------------------
# Extraction
# ---------------------------------------------------------------------------

def extract_listings(html: str, board: dict) -> list[JobListing]:
    """
    Extract all job listings from a board's HTML page.

    Uses the board-specific CSS selectors from the configuration dict.

    Args:
        html: Page HTML content.
        board: Job board configuration dict.

    Returns:
        List of JobListing objects.
    """
    soup = BeautifulSoup(html, "lxml")
    listings = []

    article_elements = soup.select(board["listing_selector"])
    if not article_elements:
        logger.warning(
            f"No listing elements found on {board['name']} "
            f"(selector: '{board['listing_selector']}'). "
            "Site structure may have changed."
        )
        return []

    logger.info(
        f"Found {len(article_elements)} listing element(s) on {board['name']}."
    )

    for element in article_elements:
        try:
            # Helper: get text from a selector within this element
            def get_text(selector: str) -> str:
                tag = element.select_one(selector)
                return tag.get_text(strip=True) if tag else ""

            # Title and URL
            title_tag = element.select_one(board["title_selector"])
            title = title_tag.get_text(strip=True) if title_tag else ""
            relative_url = title_tag.get("href", "") if title_tag else ""
            job_url = urljoin(board["base_url"], relative_url)

            # Other fields
            client = get_text(board["client_selector"])
            budget = get_text(board["budget_selector"])
            duration = get_text(board["duration_selector"])
            description = get_text(board["description_selector"])

            # Tags — multiple elements
            tag_elements = element.select(board["tags_selector"])
            tags = [t.get_text(strip=True) for t in tag_elements]

            # Job ID and posted date from data attributes
            job_id = element.get(board["id_attr"], "")
            posted_date = element.get(board["date_attr"], "")

            if not title:
                continue  # Skip malformed elements

            listings.append(JobListing(
                board=board["name"],
                job_id=job_id,
                title=title,
                client=client,
                budget=budget,
                duration=duration,
                description=description,
                tags=tags,
                posted_date=posted_date,
                job_url=job_url,
            ))

        except Exception as e:
            logger.warning(f"Error extracting listing: {e}")
            continue

    return listings


# ---------------------------------------------------------------------------
# Keyword filtering
# ---------------------------------------------------------------------------

def check_relevance(listing: JobListing) -> tuple[bool, list[str]]:
    """
    Check whether a job listing is relevant to Maya's specialization.

    Combines title, description, and tags for matching.
    Excludes listings that match exclude keywords even if include keywords match.

    Args:
        listing: A JobListing to evaluate.

    Returns:
        Tuple of (is_relevant: bool, matched_keywords: list[str])
    """
    # Build full search text
    searchable_text = " ".join([
        listing.title,
        listing.description,
        " ".join(listing.tags),
    ]).lower()

    # Pad with spaces to help with partial word avoidance
    searchable_text = f" {searchable_text} "

    # Check exclusion keywords first
    for exclude_kw in EXCLUDE_KEYWORDS:
        if exclude_kw in searchable_text:
            return False, []

    # Check inclusion keywords
    matched = []
    for include_kw in INCLUDE_KEYWORDS:
        if include_kw in searchable_text:
            matched.append(include_kw.strip())

    is_relevant = len(matched) > 0
    return is_relevant, matched


def filter_listings(listings: list[JobListing]) -> list[JobListing]:
    """
    Apply keyword filters to a list of job listings.
    Updates each listing in place and returns only the relevant ones.
    """
    relevant = []
    for listing in listings:
        is_relevant, matched = check_relevance(listing)
        listing.is_relevant = is_relevant
        listing.matched_keywords = matched
        if is_relevant:
            relevant.append(listing)
    return relevant


# ---------------------------------------------------------------------------
# Deduplication: skip listings already saved
# ---------------------------------------------------------------------------

def load_seen_job_ids(csv_path: Path) -> set[str]:
    """
    Load job IDs already recorded in the output CSV.
    Used to avoid saving duplicate listings on repeated runs.
    """
    seen = set()
    if not csv_path.exists():
        return seen
    with open(csv_path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            key = f"{row.get('board', '')}_{row.get('job_id', '')}"
            seen.add(key)
    return seen


# ---------------------------------------------------------------------------
# CSV output
# ---------------------------------------------------------------------------

FIELDNAMES = [
    "board",
    "job_id",
    "title",
    "client",
    "budget",
    "duration",
    "posted_date",
    "matched_keywords",
    "tags",
    "description",
    "job_url",
    "scraped_at",
]


def save_new_opportunities(
    listings: list[JobListing],
    seen_ids: set[str],
    output_path: Path,
) -> int:
    """
    Save new (not previously recorded) job listings to CSV.

    Args:
        listings: Relevant listings to potentially save.
        seen_ids: Set of job IDs already in the CSV.
        output_path: Output CSV file path.

    Returns:
        Number of new listings saved.
    """
    output_path.parent.mkdir(parents=True, exist_ok=True)
    file_exists = output_path.exists()
    new_count = 0

    with open(output_path, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=FIELDNAMES, extrasaction="ignore")
        if not file_exists:
            writer.writeheader()

        for listing in listings:
            uid = f"{listing.board}_{listing.job_id}"
            if uid in seen_ids:
                logger.debug(f"Skipping duplicate: {uid}")
                continue

            writer.writerow({
                "board": listing.board,
                "job_id": listing.job_id,
                "title": listing.title,
                "client": listing.client,
                "budget": listing.budget,
                "duration": listing.duration,
                "posted_date": listing.posted_date,
                "matched_keywords": ", ".join(listing.matched_keywords),
                "tags": ", ".join(listing.tags),
                "description": listing.description[:500],  # truncate long descriptions
                "job_url": listing.job_url,
                "scraped_at": listing.scraped_at,
            })
            new_count += 1
            seen_ids.add(uid)  # prevent duplicates within this run

    return new_count


# ---------------------------------------------------------------------------
# Run logging
# ---------------------------------------------------------------------------

def log_run(boards_checked: int, total_found: int, relevant: int, new_saved: int) -> None:
    """Record metadata about each monitoring run for auditing."""
    RUN_LOG_CSV.parent.mkdir(parents=True, exist_ok=True)
    file_exists = RUN_LOG_CSV.exists()

    with open(RUN_LOG_CSV, "a", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        if not file_exists:
            writer.writerow([
                "run_datetime", "boards_checked",
                "total_listings_found", "relevant_listings", "new_saved"
            ])
        writer.writerow([
            datetime.now().isoformat(),
            boards_checked,
            total_found,
            relevant,
            new_saved,
        ])


# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------

def run_job_monitor() -> None:
    """Run the daily job board monitoring pipeline."""
    today = date.today().strftime("%A, %B %d, %Y")
    print(f"Maya Reyes Consulting — Job Opportunity Monitor")
    print(f"Date: {today}")
    print()

    # Load already-seen job IDs (deduplication)
    seen_ids = load_seen_job_ids(OPPORTUNITIES_CSV)
    logger.info(f"Loaded {len(seen_ids)} previously seen job ID(s).")

    all_listings = []
    boards_checked = 0

    for board in JOB_BOARDS:
        print(f"Checking {board['name']}...")

        if not is_scraping_allowed(board["listings_url"]):
            print(f"  SKIPPED: robots.txt disallows access.")
            continue

        html = fetch_page(board["listings_url"], board["crawl_delay"])
        if not html:
            print(f"  FAILED: Could not fetch listings page.")
            continue

        listings = extract_listings(html, board)
        print(f"  Found {len(listings)} listing(s).")
        all_listings.extend(listings)
        boards_checked += 1

    # Filter for relevant opportunities
    relevant_listings = filter_listings(all_listings)
    print(f"\nKeyword filtering:")
    print(f"  Total listings scraped:  {len(all_listings)}")
    print(f"  Relevant (matching):     {len(relevant_listings)}")
    print(f"  Excluded (not matching): {len(all_listings) - len(relevant_listings)}")

    # Save new opportunities
    new_saved = save_new_opportunities(relevant_listings, seen_ids, OPPORTUNITIES_CSV)

    # Log this run
    log_run(
        boards_checked=boards_checked,
        total_found=len(all_listings),
        relevant=len(relevant_listings),
        new_saved=new_saved,
    )

    # Print summary
    print()
    print("=" * 50)
    print("DAILY MONITOR SUMMARY")
    print(f"  Boards checked:    {boards_checked}")
    print(f"  Listings scraped:  {len(all_listings)}")
    print(f"  Relevant to Maya:  {len(relevant_listings)}")
    print(f"  New (not seen):    {new_saved}")
    print(f"  Output file:       {OPPORTUNITIES_CSV.resolve()}")
    print("=" * 50)

    if relevant_listings:
        print()
        print("NEW OPPORTUNITIES TODAY:")
        new_ones = [l for l in relevant_listings
                    if f"{l.board}_{l.job_id}" in seen_ids]
        if not new_ones:
            # Since we just saved them, check by new_saved count
            print(f"  {new_saved} new opportunity/opportunities saved to CSV.")
            for listing in relevant_listings[:5]:  # show first 5
                print(f"\n  [{listing.board}] {listing.title}")
                print(f"  Client: {listing.client}")
                print(f"  Budget: {listing.budget} | Duration: {listing.duration}")
                print(f"  Keywords: {', '.join(listing.matched_keywords)}")
                print(f"  URL: {listing.job_url}")
    else:
        print("\nNo relevant opportunities found today.")
        print("Check back tomorrow, or adjust your keyword list.")


if __name__ == "__main__":
    run_job_monitor()

Sample Output

When Maya runs the script on a typical morning:

Maya Reyes Consulting — Job Opportunity Monitor
Date: Thursday, December 05, 2024

Checking FreelanceHub...
  Found 24 listing(s).
Checking AnalyticsGigs...
  Found 18 listing(s).

Keyword filtering:
  Total listings scraped:  42
  Relevant (matching):     7
  Excluded (not matching): 35

==================================================
DAILY MONITOR SUMMARY
  Boards checked:    2
  Listings scraped:  42
  Relevant to Maya:  7
  New (not seen):    5
  Output file:       C:\Maya\output\consulting_opportunities.csv
==================================================

NEW OPPORTUNITIES TODAY:

  [FreelanceHub] Business Intelligence Dashboard Developer
  Client: Thornfield Media
  Budget: $75–$95/hour | Duration: 3–6 months
  Keywords: business intelligence, tableau, python, sql
  URL: https://www.freelancehub.com/jobs/87234/

  [AnalyticsGigs] Data Analytics Consultant — E-commerce
  Client: Novak Digital
  Budget: $80–$100/hour | Duration: 2–4 months
  Keywords: data analytics, python, pandas, kpi, reporting
  URL: https://www.analyticsgigs.com/listings/87652/

  [FreelanceHub] KPI Reporting Framework — Financial Services
  Client: [Confidential]
  Budget: $90/hour | Duration: Ongoing
  Keywords: kpi, sql, dashboard, business analytics
  URL: https://www.freelancehub.com/jobs/87198/

Maya's Weekly Workflow

The script runs automatically each morning at 7:00 AM via her Mac's launchd scheduler (similar to cron). Her morning routine:

Open consulting_opportunities.csv in Excel or Numbers
Filter by scraped_at = today's date to see only new listings
Review the 3–8 relevant listings (typically)
Mark the ones worth pursuing and visit their URLs directly
Apply through the platform's normal process

Time per day: About ten minutes of review instead of fifteen minutes of browsing plus filtering. The real saving is consistency — she checks every single weekday now, because the script does the tedious part automatically.

Key Lessons from This Case Study

Keyword filtering multiplies the value. Without filtering, Maya sees 40 listings per day and must manually evaluate each. With filtering, she sees 5–8 pre-qualified opportunities. The script makes 35 irrelevant decisions automatically, freeing Maya to focus on the ones that actually matter.

Deduplication is essential for daily scrapers. Without it, every day's run would add the same listings to the CSV. The seen_ids set prevents this — a listing scraped Monday will not appear again on Tuesday if the position is still open.

Configurations change; code should not. When a job board updates its HTML structure (which happens), Maya updates only the CSS selectors in the JOB_BOARDS configuration list, not the extraction logic. The separation of configuration from code makes maintenance straightforward.

Run logging for accountability. The monitor_run_log.csv file records every run — how many listings were found, how many were relevant, how many were new. If the script starts returning zero listings consistently, that log tells Maya something might be wrong with the page structure.

Scheduling is outside Python. The script itself is just logic. Scheduling it to run daily is a system task — cron on Mac/Linux, Task Scheduler on Windows, or a cloud scheduler for something that needs to run on a server. Python's job is to do the work well; the scheduler's job is to invoke it at the right time.