It uses @bobbysq’s method, but then continues on until it finds the next rule or other paragraph break. That allows it to capture (most of) the blue boxes and lists after rules, but it introduces other problems when the rule is directly followed by a text paragraph without a heading. Some of the rules are also missing their hyperlinks, which makes them invisible to the scraper. Basically, the HTML is a mess and not given to easy scraping.
I’m starting to warm to @UnofficialForth’s idea to just take a bunch of screenshots and manage a database of them. It would be a bit of work right after kickoff, but once everything is done they don’t usually update many rules each week so it should be feasible.
The more I think about it, I’m not a huge fan of the image database.
Sure it works to check a specific rule, such as R15. However, who has the naming of the rule R15 memorized (and not the rule itself)? What if I’m searching for a specific rule? Can I find a it easily?
Looking at what the end user needs, I don’t think my initial idea of an image database works. It’s too simplified and limited in the scope of what team would want to use it for. If we want to make something to truly benefit the community, we would have to take it past that. (Images such as diagrams might still be nice.)
True, but at the same time the above link was one click referenced to R15. That’s come in very handy in conversations compared to “open this PDF, go to page X, find “R15” on it”. At least we have options for both, I definitely wouldn’t want to get rid of the PDF any time soon, because in current form, the HTML version is not a proper replacement.
I implemented this into a Discord bot that I never really finished. Anyone willing to implement my code into a bot or program of their own is more than welcome to, just please credit me in a comment or something.