Skip to main content

Command Palette

Search for a command to run...

Week 6 & Week 7 - Scraping & Structured Extraction

A hands on recap of two weeks focused on accessing, parsing, and exporting data through modular scraping workflows.

Published
โ€ข2 min read
Week 6 & Week 7 - Scraping & Structured Extraction
R

Hey, I'm Ramya ๐Ÿ‘‹I write to learn, and I learn by building. This space is my digital notebook where curiosity meets clarity and every post reflects a milestone in my journey. I'm a final-year B.Tech student in Artificial Intelligence & Data Science at GMR Institute of Technology. I recently completed an internship at Tao Digital, where I worked on AWS cloud services and contributed to a Smart Fridge Annotation Project using YOLOv11. Learning Out Loud is my blog a place where I document what I learn, build, and reflect on. Itโ€™s organized into evolving series like:๐Ÿ“š Foundation Phase Series : Week-by-week insights from my early cloud and data engineering journey. I believe in thoughtful growth, clean documentation, and expressive storytelling. Whether itโ€™s building ETL pipelines, annotating datasets, or writing about yoga and balance Iโ€™m here to share what matters.

๐Ÿ‘‹ Introduction

After completing my Foundation Phase, Iโ€™ve stepped into the Core Workflow Phase where the focus shifts from setup to flow. These two weeks were all about understanding how data moves, how tools connect, and how logic becomes pipelines. I transitioned from raw access to structured extraction, building scraping systems that are both ethical and scalable.


๐Ÿง  Week 6: From Curiosity to Craft

This week laid the groundwork for scraping starting with the โ€œwhyโ€ and moving into the โ€œhow.โ€

๐Ÿ”น Scraping Ethics & Strategy

I explored the ethics of scraping and how to design flows that respect site structure and access policies.

Topics Explored:

  • Static vs dynamic sites

  • Choosing between requests, Scrapy, and Selenium

  • Mimicking human behavior with headers, delays, retries

  • Scraping as negotiation: availability vs permission

๐Ÿ”น Python Requests: Lightweight & Powerful

Requests became my gateway to APIs and HTML pages. I focused on modularizing logic for reuse.

Topics Practiced:

  • GET/POST requests with headers and query parameters

  • Pagination, timeouts, session persistence

  • Status codes and response handling

  • Reusable request functions across scripts

๐Ÿงฉ What Shifted

I started thinking in flows:
requests โ†’ raw HTML โ†’ (next: parsing) โ†’ cleaned data โ†’ reusable modules


๐Ÿง  Week 7: From HTML to Structured Data

This week was all about parsing, spidering, and exporting turning raw HTML into usable datasets.

๐Ÿ”น BeautifulSoup: Parsing with Precision

I learned to navigate HTML like a map identifying tags, classes, and IDs to extract structured data.

Topics Practiced:

  • .find(), .find_all(), .attrs

  • .text, .strip(), regex cleanup

  • Handling missing elements and nested tags

  • Combining with requests for full flow

๐Ÿ”น Scrapy: Scalable Crawling

Scrapy introduced a new mindset modular, maintainable scraping with spider logic and pagination.

Topics Explored:

  • scrapy startproject, spider scaffolding

  • response.css() and response.xpath()

  • Pagination with response.follow()

  • Exporting to CSV/JSON

  • Navigating GitHub documentation and source code

๐Ÿ”น CSS Selectors: Targeting with Intent

I refined my selector logic to extract movie names, URLs, and nested elements with precision.

Topics Practiced:

  • Combined selectors for clean output

  • Selector testing in Scrapy Shell

  • Adapting logic across different site layouts


๐Ÿ“‚ Daily Reflections (Day 24 to Day 31)
You can view my documented progress here:
๐Ÿ”— GitHub: Week6 Reflection
๐Ÿ”— GitHub: Week7 Reflection