Unusual post reading quantity on CD

I was browsing around the Users list on CD, and noticed an oddity: Two accounts (@asdfzxcv and @juchong) seem to have absolutely huge amounts of topics viewed - orders of magnitude beyond every other user, and probably close to the number of topics posted on CD ever. This is in spite of having a very low quantity of posts read and days visited.

I have two guesses as to what’s happening here: either there’s a bug on the server end, or they are scraping CD (if so, what for?). In either case, I think this merits further investigation.

8 Likes

Juan’s just getting some light reading in before bed.

48 Likes

probably scraping

4 Likes

Juan’s a weird dude - he’s like, one of the few people I know who have weirder sleep schedules than I do.

3 Likes

Nothing wrong with scraping tbh and it’s not that difficult to do. Useful for saving the site incase anything happens to it. See re:IFI Hosting. If anyone tries to ban a scraper they’ll usually adapt. Best solution if it really matters is rate limiting, offering an official API, or both.

TL;DR Web scrapers are the internets independent archivists.

5 Likes

Lmao, can we deal with this later? I think this website already has enough going on as is. (/s)

2 Likes

its not difficult to do poorly.

1 Like

Poorly for who :joy:, it’s a legitimate method to say “GIVE ME ALL YOUR DATA NOW!!” rather than “Give me your data at a responsible pace please” provided the application is threaded and uses multiple proxies.

1 Like

…thats just another form of DDOSing though? Enough requests at once and you cripple the server (again, see above image).

4 Likes

Sometimes (also if it coming from one host it’s a DoS), MOST websites should be resistant to this with simple load balancing or even by using services like cloudflare which offer “DDoS Protection”. Really just depends on the server infrastructure. Depending on load a sufficiently powerful server in terms of RAM and processor cores/speed should be sufficient too.

Ahhh! I’ve been noticed! Yes. I’m scraping CD in hopes of archiving things should the site “disappear” one day. In light of recent events, I feel it’s prudent to have a large part of FIRST’s history archived. Also, I have a massive NAS sitting in a data center and it’s relatively convenient to kick off a script and walk away.

80 Likes

I have a complete* archive up to a point in 2017, and I’m running a crawl right now, if anyone’s interested in coordinating let me know. I want to avoid completely redundant crawls if possible.

4 Likes

How to make a backup

3 crawls

Across

3 people

Where

3 are off-site

Or something

9 Likes

Multiple offsite secure backups of all the stupid cringe stuff I posted as an idiot teenager talking outa my rearend about robotics.

Great!

33 Likes

1st rule of the internet. Everything is forever. 2nd rule of the internet. There’s so much content created every day that forever is a short amount of time.

6 Likes

I feel bad for the scraper when it hits my old scouting databases, before I used GitHub I published a massive file every week. Brandon even increased the max file size limit at my request lol. I think I calculated it out once and my excel books combined accounted for something on the order of 10% of the total data on CD.

33 Likes

At least that’s useful content. Getting all my posts and those from @notmattlythgoe is like getting a 0 byte file that measures a couple gigs.

9 Likes

Can you send me your script?

I understand some of the things happening in this post :sweat_smile:

1 Like

That makes one of us.

1 Like