

I am doing something very similar, but for different kinda source (pdfs) and connverting to json (json/yaml do not matter).
what i have done is
- create a good enough template. - this is very important. i can not show my template exactly as it is work related, but it is simple, like define various key value pairs, and how it is meant to be presented. something like
{
// charecter description
"name": "NAME_OF_CHARACTER",
"powers": [{name: "fly"},{name: "see_through_walls"} ]
}
and so on. try to cover as many cases you think that can be done.
-
install llama cpp (or ollama works too), i am using smollm3 3b (more knowledge, but slower (15-18tps)) and qwen3 1.7b (less knowledge, faster(25 tps)), i am currenty just running stuff on my laptop igpu.
-
here is my simplified code ( i have removed some important bits which are work related from promt, but just imagine a detailed prompt aking model to do something)
# assuming pdf with text - if it does not have text, then we might have to perform ocr
import sys
import pdftotext
input_file = sys.argv[1]
# Load your PDF
with open(input_file, "rb") as f:
pdf = pdftotext.PDF(f)
pdf_text = "\n\n".join(pdf)
# print(pdf_text)
# reading the jsonc template
with open('./sample-json-skeleton.jsonc', 'r') as f:
template = f.read().strip()
# print(template)
# creating the prompt - we want to ask the model to fit the given pdf_text into a format sigven by json template
prompt = "/no_think You have to parse a given text according to given json template. You must not generate false data, or alter sentences much, and must try to keep most things verbatim \n here is the json template. do note the template currently contains comments, but you should try to not generate any comments. Stick very closely to the structure of template, and do not create any new headers. do not create keys which do not exist in template. if you find a key or title from the source, try to fit it to keys/titles from the template. stick with the format. if you are unable to fit something to given template, add the additional section as that is the catch all section. Stick to the template. \n\n``` \n " + template + " \n``` \n\n and here is the data that you have to parse \n\n``` \n " + pdf_text + " \n```"
# print(prompt)
# asking llm to parse
# using openai's python lib to call, but I am not calling openai's servers. instead I am using a locally hosted openai api compatible server (llama.cpp-server ftw)
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11737/", api_key="sk-xxx")
config = { 'temperature': 0.4, }
response = client.chat.completions.create(model="", messages=[{"role": "user", "content": [{"type": "text", "text": prompt},],}],)
print(response.choices[0].message.content)
it is not perfect, but i get 85+% on the way, and it is simple enough. if you need some more help, please ask.
and also, how are you getting the wiki? i would first scrape it . if it is something like fandom, then do not scrape directly, first host your own breeze wiki (https://docs.breezewiki.com/Running.html), then use wget with a optimal rate limit. using breeezewiki will remove some junk, and you will get cleaner html to begin with.
for small models, try to keep total input (prompt plus data) to be small, as they general can not reatin there smarts for much (even if they advertise larger contexts).