Rules Semantic Search

Hello again, CD! I made a topic about a site I made for searching the FRC rules recently, but I made a major update that I thought deserved its own post as I’m going into some technical detail.

You can play with the semantic search by going to https://rules-search.pages.dev/ and toggling on Use Semantic Search. Try asking it queries like “can i cross the line before teleop” and “what gauge wire should i use for a 20 amp connection”

After @_aq on the FRC Discord suggested adding semantic search, I looked into the various ways to implement it. I was initially drawn towards Elasticsearch because of the “enterprise” feel and the fact that it has a pre-trained semantic search model and integration. Then I realized, in fashion with typical “enterprise” software, the ML features only come with a $95 per month subscription. After some googling, I found https://haystack.deepset.ai/, but I had trouble installing it and I decided to go with sentence-transformers.

Sentence-transformers has documentation on how to implement semantic search, so I won’t go over that. After playing with the search a little, I realized the off-the-shelf models did not have context for a lot of the specific vocabulary used in the game manual, so I decided to tune the model for the specific FRC domain. To do this, I utilized the Generative Psuedo Learning. I created a corpus based on the 2021, 2022, and 2023 rulebooks, for a total of 497 documents. The GPL script takes a corpus, generates hundreds of questions that are answered by the rule, and finds answers that do not fit those questions. It then trains a text embedding model based on these expected and incorrect outcomes. I ended up training it on an A100 (40gb) GPU on Google Colab. The other options on Colab did not have enough VRAM to train the model, so I ended up paying $20 for 200 credits, 40 of which remained.

The search server runs flask, and returns the rule number (currently used as a unique id) of each result. It currently returns the first 5 results.

Creating the corpus GPL expects a corpus in a BEIR .jsonl format. It's not very well documented, but I figured out that it expects a document where:
  • Each line is a JSON document
  • Each JSON document contains the keys
    • text: the document’s text
    • title: a title for the document. it does not have to be unique and can be blank
    • _id: a unique string id in any format
34 Likes

This is so amazing, I love it, for sure it will be very useful!!

This is great if you know the right question to ask. I asked, “How fast can the robot move?” and got rules about height, extensions and shooting game pieces.

1 Like

I believe that’s because the manual doesn’t mention a maximum speed anywhere, so it’s just trying to find the next closest.

Can you incorporate data from Q&A into this?

6 Likes

Site appears to be down

Ah, it is. The cloudflare portal is being buggy right now, so I’m not sure if the problem is my side or if cloudflare’s just down.

Edit:
It does seem to be a problem on my side so I’ve rolled back prod to an earlier commit.

Error says something about failure to render so it could totally just be the workers runner messing up cloudflare side

You’re most likely correct. But it’s the kind of question that comes up, either as a design consideration during build, or as a question about “high speed ramming” during the season.

It’s like one of those, “can’t prove a negative” things.

I wonder if you could train it to respond, “Subject not found.” Or is that too AI?

Wow this is pretty nifty! How long do you expect it to take to be up and running with the new manual after kickoff? (assuming that is the plan)

Probably a few minutes. I have a deploy GitHub action so that once I see it on the stream I can have it built and deployed as fast as possible. If the HTML format is changed (unlikely, it’s exported from Word) or if they don’t have a html file uploaded yet, I’ve been working on a setup that parses the pdf. It won’t provide as much detail but should allow for an equivalent experience.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.