scrapy

👋 Introduction

After completing my Foundation Phase, I’ve stepped into the Core Workflow Phase where the focus shifts from setup to flow. These two weeks were all about understanding how data moves, how tools connect, and how logic becomes pipelines. I transitioned from raw access to structured extraction, building scraping systems that are both ethical and scalable.

🧠 Week 6: From Curiosity to Craft

This week laid the groundwork for scraping starting with the “why” and moving into the “how.”

🔹 Scraping Ethics & Strategy

I explored the ethics of scraping and how to design flows that respect site structure and access policies.

Topics Explored:

Static vs dynamic sites
Choosing between requests, Scrapy, and Selenium
Mimicking human behavior with headers, delays, retries
Scraping as negotiation: availability vs permission

🔹 Python Requests: Lightweight & Powerful

Requests became my gateway to APIs and HTML pages. I focused on modularizing logic for reuse.

Topics Practiced:

GET/POST requests with headers and query parameters
Pagination, timeouts, session persistence
Status codes and response handling
Reusable request functions across scripts

🧩 What Shifted

I started thinking in flows:
requests → raw HTML → (next: parsing) → cleaned data → reusable modules

🧠 Week 7: From HTML to Structured Data

This week was all about parsing, spidering, and exporting turning raw HTML into usable datasets.

🔹 BeautifulSoup: Parsing with Precision

I learned to navigate HTML like a map identifying tags, classes, and IDs to extract structured data.

Topics Practiced:

.find(), .find_all(), .attrs
.text, .strip(), regex cleanup
Handling missing elements and nested tags
Combining with requests for full flow

🔹 Scrapy: Scalable Crawling

Scrapy introduced a new mindset modular, maintainable scraping with spider logic and pagination.

Topics Explored:

scrapy startproject, spider scaffolding
response.css() and response.xpath()
Pagination with response.follow()
Exporting to CSV/JSON
Navigating GitHub documentation and source code

🔹 CSS Selectors: Targeting with Intent

I refined my selector logic to extract movie names, URLs, and nested elements with precision.

Topics Practiced:

Combined selectors for clean output
Selector testing in Scrapy Shell
Adapting logic across different site layouts

📂 Daily Reflections (Day 24 to Day 31)
You can view my documented progress here:
🔗 GitHub: Week6 Reflection
🔗 GitHub: Week7 Reflection

Week 6 & Week 7 - Scraping & Structured Extraction

👋 Introduction

🧠 Week 6: From Curiosity to Craft

🔹 Scraping Ethics & Strategy

🔹 Python Requests: Lightweight & Powerful

🧩 What Shifted

🧠 Week 7: From HTML to Structured Data

🔹 BeautifulSoup: Parsing with Precision

🔹 Scrapy: Scalable Crawling

🔹 CSS Selectors: Targeting with Intent

Comments

Core Workflows Phase

More from this blog

📘 Foundation Phase Completed - Starting Phase 2 of My Journey

🛠️Week 5: The Week I Started Thinking Like a Builder

🐧 Building My First ETL Pipeline with Linux, Python & PostgreSQL — A Beginner’s Data Engineering Project

📊 Foundation Phase: Week 3 PostgreSQL Practice & Query Mastery

Command Palette

👋 Introduction

🧠 Week 6: From Curiosity to Craft

🔹 Scraping Ethics & Strategy

🔹 Python Requests: Lightweight & Powerful

🧩 What Shifted

🧠 Week 7: From HTML to Structured Data

🔹 BeautifulSoup: Parsing with Precision

🔹 Scrapy: Scalable Crawling

🔹 CSS Selectors: Targeting with Intent

Comments

Core Workflows Phase

More from this blog