Hello again, CD! I made a topic about a site I made for searching the FRC rules recently, but I made a major update that I thought deserved its own post as I’m going into some technical detail.
You can play with the semantic search by going to https://rules-search.pages.dev/ and toggling on Use Semantic Search. Try asking it queries like “can i cross the line before teleop” and “what gauge wire should i use for a 20 amp connection”
After @_aq on the FRC Discord suggested adding semantic search, I looked into the various ways to implement it. I was initially drawn towards Elasticsearch because of the “enterprise” feel and the fact that it has a pre-trained semantic search model and integration. Then I realized, in fashion with typical “enterprise” software, the ML features only come with a $95 per month subscription. After some googling, I found https://haystack.deepset.ai/, but I had trouble installing it and I decided to go with sentence-transformers.
Sentence-transformers has documentation on how to implement semantic search, so I won’t go over that. After playing with the search a little, I realized the off-the-shelf models did not have context for a lot of the specific vocabulary used in the game manual, so I decided to tune the model for the specific FRC domain. To do this, I utilized the Generative Psuedo Learning. I created a corpus based on the 2021, 2022, and 2023 rulebooks, for a total of 497 documents. The GPL script takes a corpus, generates hundreds of questions that are answered by the rule, and finds answers that do not fit those questions. It then trains a text embedding model based on these expected and incorrect outcomes. I ended up training it on an A100 (40gb) GPU on Google Colab. The other options on Colab did not have enough VRAM to train the model, so I ended up paying $20 for 200 credits, 40 of which remained.
The search server runs flask, and returns the rule number (currently used as a unique id) of each result. It currently returns the first 5 results.
Creating the corpus
GPL expects a corpus in a BEIR .jsonl format. It's not very well documented, but I figured out that it expects a document where:- Each line is a JSON document
- Each JSON document contains the keys
text
: the document’s texttitle
: a title for the document. it does not have to be unique and can be blank_id
: a unique string id in any format