Andrew Mitchel LLC

International Tax Blog - New and Interesting International Tax Issues


Semantic Search - Thinking Out Loud

2025-09-19

I have been thinking about training a transformer model for semantic search with a focus on US tax law. I am probably in way over my head, but I am just thinking out loud.

It seems like BERT,[1] or one of its variations,[2] would be a good place to start.

I have heard of TaxBERT. However, TaxBERT seems to be trained on corporate tax disclosures, rather than on US tax law.

One component that I think may be missing in generic pretrained models is the importance of code section numbers. Code sections are critical to understanding US tax law --- I "think" in code sections.

I am not sure about this, but I think many pretrained models may minimize numbers, or even ignore them entirely. Thus, a reference in a case or a ruling, for example, to "Code §166" may have no semantical meaning in the model. To me, Code §166 screams "bad debt" expense/deduction. Code §166 may also be integrally related to Code §162 (for trade or business expenses [e.g., business vs. non-business bad debts]). Do most pretrained models just ignore the 166 and the 162? If yes, then they may not make the connection between the two code sections.

I have downloaded over 14,000 Tax Court opinions from the Tax Court website. These would seem like a good place to train from (but perhaps too much for my 32GB of RAM and/or my single GPU?). I have converted the PDFs to text files. The data in the text files is a little jumbled because of the placement of footnotes. The footnotes are at the bottom of the PDF file pages. When converted to text, the footnote sentence(s) are simply inserted into the document text, which may be in the middle of sentence in the document. I do have some ideas to fix this, but I haven't had a chance to research it yet.

Regarding the code section numbers, I was thinking of ways to help the model understand their importance. First, I would need to identify which numbers are code section numbers and which ones are not. (I already have some scripts that pretty accurately identify which numbers are code section numbers.) I was then thinking of replacing the code section numbers with words ("replacement text"). For example, I might replace a reference to "section 166" with the words "bad debt expense". This way, when training, the model would hopefully capture the semantical meaning of code section numbers.

I am not yet sure whether the replacement text should have some special identification so that the model could understand that the replacement text was special somehow.

If anyone else is thinking about similar topics or working on similar projects, I would love to connect.


  1. Bidirectional Encoder Representations from Transformers ("BERT").

  2. SBERT, RoBERTa, DistilBERT, ModernBERT.

Tags: Python