Case Study 20-2: Maya Monitors Job Boards for Consulting Opportunities
The Situation
Maya Reyes's consulting practice depends on a steady stream of new clients. New opportunities come in unpredictably — some months bring three strong leads, other months bring none. Her two primary sources for new work are word of mouth (which she cannot control) and job boards (which she can monitor).
Maya currently checks two job boards each morning: one general freelance platform and one specialized consulting-and-analytics marketplace. The manual process takes about fifteen minutes per day and is easy to forget on busy mornings. More importantly, on a freelance marketplace, good opportunities go fast — a posting might attract ten qualified bids within the first hour. Maya has missed opportunities by checking late.
Her goal: automate the daily check so that new opportunities matching her skills appear in a single CSV file each morning, ready for her review over coffee. She is not looking to auto-apply — she reviews every opportunity before responding — but she wants the filtering and aggregation done automatically.
The Skill Keywords
Maya's specialization is data analytics and business intelligence consulting. She is looking for opportunities that mention:
- Data analytics / data analysis / business analytics
- Business intelligence / BI
- Python / pandas / SQL
- Dashboard / reporting / visualization
- Tableau / Power BI / Looker
- Data strategy / KPI
She is NOT interested in: - Machine learning / AI / deep learning (not her specialty) - Software engineering / full-stack development - IT support / systems administration
Due Diligence
Platform 1: FreelanceHub (hypothetical)
robots.txt:
User-agent: *
Disallow: /private/
Disallow: /messages/
Disallow: /profile/edit/
Allow: /jobs/
Crawl-delay: 2
The public jobs listing pages are explicitly allowed. Terms of Service permit scraping public listings for personal use (finding opportunities for yourself is personal use).
Platform 2: AnalyticsGigs (hypothetical)
robots.txt:
User-agent: *
Disallow: /api/
Disallow: /search/
Allow: /listings/
Crawl-delay: 3
The /listings/ path is allowed; the /search/ path is not. Maya uses the direct listing URL rather than search results.
Inspecting the Page Structure
Maya opens FreelanceHub's job listings page and inspects the HTML (via browser Developer Tools → Inspect Element):
<div class="job-board">
<article class="job-listing" data-job-id="87234" data-posted="2024-12-04">
<div class="job-header">
<h2 class="job-title">
<a href="/jobs/87234/">Business Intelligence Dashboard Developer</a>
</h2>
<span class="client-name">Thornfield Media</span>
<span class="posted-ago">Posted 3 hours ago</span>
</div>
<div class="job-details">
<span class="budget-range">$75–$95/hour</span>
<span class="duration">3–6 months</span>
<span class="engagement-type">Contract</span>
</div>
<div class="job-description">
<p>We need an experienced BI developer to build Tableau dashboards
for our marketing analytics team. Python experience preferred.
Must have strong SQL skills...</p>
</div>
<div class="job-tags">
<span class="tag">Tableau</span>
<span class="tag">SQL</span>
<span class="tag">Python</span>
<span class="tag">Business Intelligence</span>
</div>
</article>
<!-- more listings... -->
</div>
The Script
"""
job_board_monitor.py
====================
Maya Reyes Consulting — Daily Job Opportunity Monitor
Scrapes job board listings, filters for relevant opportunities,
and saves matches to a CSV file. Run daily via cron or Task Scheduler.
Maya reviews the CSV each morning. No auto-applying.
Author: Maya Reyes
"""
import csv
import logging
import random
import re
import time
import urllib.robotparser
from dataclasses import dataclass, field
from datetime import date, datetime
from pathlib import Path
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
# ---------------------------------------------------------------------------
# Setup
# ---------------------------------------------------------------------------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s — %(levelname)s — %(message)s",
)
logger = logging.getLogger(__name__)
USER_AGENT = (
"MayaReyes-OpportunityMonitor/1.0 "
"(Personal consulting opportunity tracker; "
"contact maya@mayareyesconsulting.com)"
)
HEADERS = {
"User-Agent": USER_AGENT,
"Accept": "text/html,application/xhtml+xml",
}
OUTPUT_DIR = Path("output")
OPPORTUNITIES_CSV = OUTPUT_DIR / "consulting_opportunities.csv"
RUN_LOG_CSV = OUTPUT_DIR / "monitor_run_log.csv"
# ---------------------------------------------------------------------------
# Keywords: what to look for and what to exclude
# ---------------------------------------------------------------------------
INCLUDE_KEYWORDS = [
"data analytics",
"data analysis",
"business analytics",
"business intelligence",
" bi ", # surrounded by spaces to avoid partial matches
"python",
"pandas",
" sql ",
"dashboard",
"reporting",
"visualization",
"tableau",
"power bi",
"looker",
"data strategy",
"kpi",
"metrics",
"data consulting",
]
EXCLUDE_KEYWORDS = [
"machine learning",
"deep learning",
"neural network",
"natural language processing",
"nlp",
"full-stack",
"full stack",
"web development",
"devops",
"it support",
"systems administrator",
"sysadmin",
]
# ---------------------------------------------------------------------------
# Job board configurations
# ---------------------------------------------------------------------------
JOB_BOARDS = [
{
"name": "FreelanceHub",
"base_url": "https://www.freelancehub.com",
"listings_url": "https://www.freelancehub.com/jobs/analytics/",
"crawl_delay": 2,
"listing_selector": "article.job-listing",
"title_selector": "h2.job-title a",
"client_selector": "span.client-name",
"budget_selector": "span.budget-range",
"duration_selector": "span.duration",
"description_selector": "div.job-description p",
"tags_selector": "div.job-tags span.tag",
"date_attr": "data-posted", # attribute on the article element
"id_attr": "data-job-id",
},
{
"name": "AnalyticsGigs",
"base_url": "https://www.analyticsgigs.com",
"listings_url": "https://www.analyticsgigs.com/listings/consulting/",
"crawl_delay": 3,
"listing_selector": "div.gig-card",
"title_selector": "h3.gig-title a",
"client_selector": "span.company",
"budget_selector": "div.rate-info",
"duration_selector": "span.project-length",
"description_selector": "p.gig-summary",
"tags_selector": "ul.skill-tags li",
"date_attr": "data-date",
"id_attr": "data-gig-id",
},
]
# ---------------------------------------------------------------------------
# Data structure
# ---------------------------------------------------------------------------
@dataclass
class JobListing:
"""Represents a single scraped job listing."""
board: str
job_id: str
title: str
client: str
budget: str
duration: str
description: str
tags: list[str]
posted_date: str
job_url: str
scraped_at: str = field(default_factory=lambda: datetime.now().isoformat())
is_relevant: bool = False
matched_keywords: list[str] = field(default_factory=list)
# ---------------------------------------------------------------------------
# robots.txt compliance
# ---------------------------------------------------------------------------
def is_scraping_allowed(url: str) -> bool:
"""Check robots.txt for the given URL."""
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
allowed = rp.can_fetch(USER_AGENT, url)
logger.info(
f"robots.txt ({parsed.netloc}): "
f"{'ALLOWED' if allowed else 'DISALLOWED'}"
)
return allowed
except Exception as e:
logger.warning(f"Could not read robots.txt: {e}")
return True
# ---------------------------------------------------------------------------
# Fetching
# ---------------------------------------------------------------------------
def fetch_page(url: str, crawl_delay: float) -> str | None:
"""Fetch a page, respecting the configured crawl delay."""
delay = crawl_delay + random.uniform(0.5, 1.5)
logger.debug(f"Waiting {delay:.1f}s before request...")
time.sleep(delay)
try:
response = requests.get(url, headers=HEADERS, timeout=15)
if response.status_code == 200:
logger.info(f"Fetched: {url}")
return response.text
elif response.status_code == 429:
logger.warning("Rate limited (429). Waiting 90s...")
time.sleep(90)
return None
else:
logger.error(f"HTTP {response.status_code}: {url}")
return None
except requests.RequestException as e:
logger.error(f"Request failed: {e}")
return None
# ---------------------------------------------------------------------------
# Extraction
# ---------------------------------------------------------------------------
def extract_listings(html: str, board: dict) -> list[JobListing]:
"""
Extract all job listings from a board's HTML page.
Uses the board-specific CSS selectors from the configuration dict.
Args:
html: Page HTML content.
board: Job board configuration dict.
Returns:
List of JobListing objects.
"""
soup = BeautifulSoup(html, "lxml")
listings = []
article_elements = soup.select(board["listing_selector"])
if not article_elements:
logger.warning(
f"No listing elements found on {board['name']} "
f"(selector: '{board['listing_selector']}'). "
"Site structure may have changed."
)
return []
logger.info(
f"Found {len(article_elements)} listing element(s) on {board['name']}."
)
for element in article_elements:
try:
# Helper: get text from a selector within this element
def get_text(selector: str) -> str:
tag = element.select_one(selector)
return tag.get_text(strip=True) if tag else ""
# Title and URL
title_tag = element.select_one(board["title_selector"])
title = title_tag.get_text(strip=True) if title_tag else ""
relative_url = title_tag.get("href", "") if title_tag else ""
job_url = urljoin(board["base_url"], relative_url)
# Other fields
client = get_text(board["client_selector"])
budget = get_text(board["budget_selector"])
duration = get_text(board["duration_selector"])
description = get_text(board["description_selector"])
# Tags — multiple elements
tag_elements = element.select(board["tags_selector"])
tags = [t.get_text(strip=True) for t in tag_elements]
# Job ID and posted date from data attributes
job_id = element.get(board["id_attr"], "")
posted_date = element.get(board["date_attr"], "")
if not title:
continue # Skip malformed elements
listings.append(JobListing(
board=board["name"],
job_id=job_id,
title=title,
client=client,
budget=budget,
duration=duration,
description=description,
tags=tags,
posted_date=posted_date,
job_url=job_url,
))
except Exception as e:
logger.warning(f"Error extracting listing: {e}")
continue
return listings
# ---------------------------------------------------------------------------
# Keyword filtering
# ---------------------------------------------------------------------------
def check_relevance(listing: JobListing) -> tuple[bool, list[str]]:
"""
Check whether a job listing is relevant to Maya's specialization.
Combines title, description, and tags for matching.
Excludes listings that match exclude keywords even if include keywords match.
Args:
listing: A JobListing to evaluate.
Returns:
Tuple of (is_relevant: bool, matched_keywords: list[str])
"""
# Build full search text
searchable_text = " ".join([
listing.title,
listing.description,
" ".join(listing.tags),
]).lower()
# Pad with spaces to help with partial word avoidance
searchable_text = f" {searchable_text} "
# Check exclusion keywords first
for exclude_kw in EXCLUDE_KEYWORDS:
if exclude_kw in searchable_text:
return False, []
# Check inclusion keywords
matched = []
for include_kw in INCLUDE_KEYWORDS:
if include_kw in searchable_text:
matched.append(include_kw.strip())
is_relevant = len(matched) > 0
return is_relevant, matched
def filter_listings(listings: list[JobListing]) -> list[JobListing]:
"""
Apply keyword filters to a list of job listings.
Updates each listing in place and returns only the relevant ones.
"""
relevant = []
for listing in listings:
is_relevant, matched = check_relevance(listing)
listing.is_relevant = is_relevant
listing.matched_keywords = matched
if is_relevant:
relevant.append(listing)
return relevant
# ---------------------------------------------------------------------------
# Deduplication: skip listings already saved
# ---------------------------------------------------------------------------
def load_seen_job_ids(csv_path: Path) -> set[str]:
"""
Load job IDs already recorded in the output CSV.
Used to avoid saving duplicate listings on repeated runs.
"""
seen = set()
if not csv_path.exists():
return seen
with open(csv_path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
key = f"{row.get('board', '')}_{row.get('job_id', '')}"
seen.add(key)
return seen
# ---------------------------------------------------------------------------
# CSV output
# ---------------------------------------------------------------------------
FIELDNAMES = [
"board",
"job_id",
"title",
"client",
"budget",
"duration",
"posted_date",
"matched_keywords",
"tags",
"description",
"job_url",
"scraped_at",
]
def save_new_opportunities(
listings: list[JobListing],
seen_ids: set[str],
output_path: Path,
) -> int:
"""
Save new (not previously recorded) job listings to CSV.
Args:
listings: Relevant listings to potentially save.
seen_ids: Set of job IDs already in the CSV.
output_path: Output CSV file path.
Returns:
Number of new listings saved.
"""
output_path.parent.mkdir(parents=True, exist_ok=True)
file_exists = output_path.exists()
new_count = 0
with open(output_path, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=FIELDNAMES, extrasaction="ignore")
if not file_exists:
writer.writeheader()
for listing in listings:
uid = f"{listing.board}_{listing.job_id}"
if uid in seen_ids:
logger.debug(f"Skipping duplicate: {uid}")
continue
writer.writerow({
"board": listing.board,
"job_id": listing.job_id,
"title": listing.title,
"client": listing.client,
"budget": listing.budget,
"duration": listing.duration,
"posted_date": listing.posted_date,
"matched_keywords": ", ".join(listing.matched_keywords),
"tags": ", ".join(listing.tags),
"description": listing.description[:500], # truncate long descriptions
"job_url": listing.job_url,
"scraped_at": listing.scraped_at,
})
new_count += 1
seen_ids.add(uid) # prevent duplicates within this run
return new_count
# ---------------------------------------------------------------------------
# Run logging
# ---------------------------------------------------------------------------
def log_run(boards_checked: int, total_found: int, relevant: int, new_saved: int) -> None:
"""Record metadata about each monitoring run for auditing."""
RUN_LOG_CSV.parent.mkdir(parents=True, exist_ok=True)
file_exists = RUN_LOG_CSV.exists()
with open(RUN_LOG_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
if not file_exists:
writer.writerow([
"run_datetime", "boards_checked",
"total_listings_found", "relevant_listings", "new_saved"
])
writer.writerow([
datetime.now().isoformat(),
boards_checked,
total_found,
relevant,
new_saved,
])
# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------
def run_job_monitor() -> None:
"""Run the daily job board monitoring pipeline."""
today = date.today().strftime("%A, %B %d, %Y")
print(f"Maya Reyes Consulting — Job Opportunity Monitor")
print(f"Date: {today}")
print()
# Load already-seen job IDs (deduplication)
seen_ids = load_seen_job_ids(OPPORTUNITIES_CSV)
logger.info(f"Loaded {len(seen_ids)} previously seen job ID(s).")
all_listings = []
boards_checked = 0
for board in JOB_BOARDS:
print(f"Checking {board['name']}...")
if not is_scraping_allowed(board["listings_url"]):
print(f" SKIPPED: robots.txt disallows access.")
continue
html = fetch_page(board["listings_url"], board["crawl_delay"])
if not html:
print(f" FAILED: Could not fetch listings page.")
continue
listings = extract_listings(html, board)
print(f" Found {len(listings)} listing(s).")
all_listings.extend(listings)
boards_checked += 1
# Filter for relevant opportunities
relevant_listings = filter_listings(all_listings)
print(f"\nKeyword filtering:")
print(f" Total listings scraped: {len(all_listings)}")
print(f" Relevant (matching): {len(relevant_listings)}")
print(f" Excluded (not matching): {len(all_listings) - len(relevant_listings)}")
# Save new opportunities
new_saved = save_new_opportunities(relevant_listings, seen_ids, OPPORTUNITIES_CSV)
# Log this run
log_run(
boards_checked=boards_checked,
total_found=len(all_listings),
relevant=len(relevant_listings),
new_saved=new_saved,
)
# Print summary
print()
print("=" * 50)
print("DAILY MONITOR SUMMARY")
print(f" Boards checked: {boards_checked}")
print(f" Listings scraped: {len(all_listings)}")
print(f" Relevant to Maya: {len(relevant_listings)}")
print(f" New (not seen): {new_saved}")
print(f" Output file: {OPPORTUNITIES_CSV.resolve()}")
print("=" * 50)
if relevant_listings:
print()
print("NEW OPPORTUNITIES TODAY:")
new_ones = [l for l in relevant_listings
if f"{l.board}_{l.job_id}" in seen_ids]
if not new_ones:
# Since we just saved them, check by new_saved count
print(f" {new_saved} new opportunity/opportunities saved to CSV.")
for listing in relevant_listings[:5]: # show first 5
print(f"\n [{listing.board}] {listing.title}")
print(f" Client: {listing.client}")
print(f" Budget: {listing.budget} | Duration: {listing.duration}")
print(f" Keywords: {', '.join(listing.matched_keywords)}")
print(f" URL: {listing.job_url}")
else:
print("\nNo relevant opportunities found today.")
print("Check back tomorrow, or adjust your keyword list.")
if __name__ == "__main__":
run_job_monitor()
Sample Output
When Maya runs the script on a typical morning:
Maya Reyes Consulting — Job Opportunity Monitor
Date: Thursday, December 05, 2024
Checking FreelanceHub...
Found 24 listing(s).
Checking AnalyticsGigs...
Found 18 listing(s).
Keyword filtering:
Total listings scraped: 42
Relevant (matching): 7
Excluded (not matching): 35
==================================================
DAILY MONITOR SUMMARY
Boards checked: 2
Listings scraped: 42
Relevant to Maya: 7
New (not seen): 5
Output file: C:\Maya\output\consulting_opportunities.csv
==================================================
NEW OPPORTUNITIES TODAY:
[FreelanceHub] Business Intelligence Dashboard Developer
Client: Thornfield Media
Budget: $75–$95/hour | Duration: 3–6 months
Keywords: business intelligence, tableau, python, sql
URL: https://www.freelancehub.com/jobs/87234/
[AnalyticsGigs] Data Analytics Consultant — E-commerce
Client: Novak Digital
Budget: $80–$100/hour | Duration: 2–4 months
Keywords: data analytics, python, pandas, kpi, reporting
URL: https://www.analyticsgigs.com/listings/87652/
[FreelanceHub] KPI Reporting Framework — Financial Services
Client: [Confidential]
Budget: $90/hour | Duration: Ongoing
Keywords: kpi, sql, dashboard, business analytics
URL: https://www.freelancehub.com/jobs/87198/
Maya's Weekly Workflow
The script runs automatically each morning at 7:00 AM via her Mac's launchd scheduler (similar to cron). Her morning routine:
- Open
consulting_opportunities.csvin Excel or Numbers - Filter by
scraped_at= today's date to see only new listings - Review the 3–8 relevant listings (typically)
- Mark the ones worth pursuing and visit their URLs directly
- Apply through the platform's normal process
Time per day: About ten minutes of review instead of fifteen minutes of browsing plus filtering. The real saving is consistency — she checks every single weekday now, because the script does the tedious part automatically.
Key Lessons from This Case Study
Keyword filtering multiplies the value. Without filtering, Maya sees 40 listings per day and must manually evaluate each. With filtering, she sees 5–8 pre-qualified opportunities. The script makes 35 irrelevant decisions automatically, freeing Maya to focus on the ones that actually matter.
Deduplication is essential for daily scrapers. Without it, every day's run would add the same listings to the CSV. The seen_ids set prevents this — a listing scraped Monday will not appear again on Tuesday if the position is still open.
Configurations change; code should not. When a job board updates its HTML structure (which happens), Maya updates only the CSS selectors in the JOB_BOARDS configuration list, not the extraction logic. The separation of configuration from code makes maintenance straightforward.
Run logging for accountability. The monitor_run_log.csv file records every run — how many listings were found, how many were relevant, how many were new. If the script starts returning zero listings consistently, that log tells Maya something might be wrong with the page structure.
Scheduling is outside Python. The script itself is just logic. Scheduling it to run daily is a system task — cron on Mac/Linux, Task Scheduler on Windows, or a cloud scheduler for something that needs to run on a server. Python's job is to do the work well; the scheduler's job is to invoke it at the right time.