I was browsing around the Users list on CD, and noticed an oddity: Two accounts (@asdfzxcv and @juchong) seem to have absolutely huge amounts of topics viewed - orders of magnitude beyond every other user, and probably close to the number of topics posted on CD ever. This is in spite of having a very low quantity of posts read and days visited.
I have two guesses as to what’s happening here: either there’s a bug on the server end, or they are scraping CD (if so, what for?). In either case, I think this merits further investigation.
Nothing wrong with scraping tbh and it’s not that difficult to do. Useful for saving the site incase anything happens to it. See re:IFI Hosting. If anyone tries to ban a scraper they’ll usually adapt. Best solution if it really matters is rate limiting, offering an official API, or both.
TL;DR Web scrapers are the internets independent archivists.
Poorly for who , it’s a legitimate method to say “GIVE ME ALL YOUR DATA NOW!!” rather than “Give me your data at a responsible pace please” provided the application is threaded and uses multiple proxies.
Sometimes (also if it coming from one host it’s a DoS), MOST websites should be resistant to this with simple load balancing or even by using services like cloudflare which offer “DDoS Protection”. Really just depends on the server infrastructure. Depending on load a sufficiently powerful server in terms of RAM and processor cores/speed should be sufficient too.
Ahhh! I’ve been noticed! Yes. I’m scraping CD in hopes of archiving things should the site “disappear” one day. In light of recent events, I feel it’s prudent to have a large part of FIRST’s history archived. Also, I have a massive NAS sitting in a data center and it’s relatively convenient to kick off a script and walk away.
I have a complete* archive up to a point in 2017, and I’m running a crawl right now, if anyone’s interested in coordinating let me know. I want to avoid completely redundant crawls if possible.
1st rule of the internet. Everything is forever. 2nd rule of the internet. There’s so much content created every day that forever is a short amount of time.
I feel bad for the scraper when it hits my old scouting databases, before I used GitHub I published a massive file every week. Brandon even increased the max file size limit at my request lol. I think I calculated it out once and my excel books combined accounted for something on the order of 10% of the total data on CD.