Back in the old days when internet connections weren’t always on and were billed by the hour, surfing the web was a race against time. I used to spend my days using a website update checker (WWWC) to quickly scan only the updated pages of my favorite sites.
This was before the iPhone 3 was sold domestically in Japan and the term ‘smartphone’ was even used. Around that time, I had a Windows CE device with a built-in PHS for about half a year. It was like a prototype of today’s smartphones. I used to program in scripting languages on that tiny screen.
That gadget could connect to the internet via PHS, and its communication speed was a fast 64kbps (or maybe 128kbps) for its time, but since it was a dial-up connection, I couldn’t enjoy the internet around the clock.
During that half-year, I had no income and was trying to keep expenses down. I had plenty of time but was in poor health, so the internet was my main connection to the world.
To minimize my internet usage, I used a scripting language (I think it was Ruby) to download entire websites to my local storage, allowing me to read them offline.
Windows CE was essentially a subset of Windows XP, so while it looked like Windows, it couldn’t run the same software. I had to create what I needed. Also, most services didn’t assume smartphones existed back then (it was the era of Yahoo!, not Google).
Technically, there were libraries for HTTP communication, but there was no HTML parsing library. So, I used regular expressions to parse HTML, extract page links, and download them repeatedly to download all pages within the same site.
Looking back, programming to parse HTML and extract links was essentially web scraping, and although the purpose was different, I was accumulating data like a crawler. In a way, it was like a personal, self-hosted Wayback Machine.
Nowadays, parsing HTML alone isn’t enough to reproduce a website locally. The mainstream approach is to programmatically manipulate a web browser as an object.
Also, if you want data, it’s better to use officially provided APIs instead of scraping. And if you want to see past web pages, you can use Pocket to save individual pages or the Internet Archive.
Considering this, there are fewer occasions for individuals to use web scraping techniques. Also, changes in website configurations require modifications to the program, making maintenance troublesome. So, the benefits of doing it yourself seem to have diminished.
I wrote this article after my homemade update checker threw an error, reminding me of the old days. The update checker is written in Python and checks for updates on websites without RSS support and a certain bulletin board site for specific posts. It runs on a 24-hour server PC once a day and sends me an email notification if there are any updates. I could check more frequently and use SNS for notifications, but since the update frequency isn’t that high, I prefer a leisurely pace.
コメント