LLM Agent AAR: Recipe Management

I thought it would be useful to record my journey through various technologies on this blog, akin to Simon Wilison's TILs. Since my experience with LLMs has felt more like fighting a battle against their less desirable proclivities, I decided that calling it an AAR is probably more apt.

Last I wrote, I was getting deepseek-r1 128B running on my machine and seeing that it had a fractional tokens/sec inference rate. It was clear that using a local LLM in anger would require running smaller and more specialized models. I'm mostly interested in programming tasks, so I wound up choosing the qwen2.5-coder family of models to do further experiments with.

The tasks I have in mind require altering files and generating code, so I did some quick research on VScode extensions. Although Cline is built around the capabilities of Claude Sonnet 3.5, I decided to go in that direction, eventually settling on using Roo Code, a fork of Cline by a veterinary technology company.

It's pretty easy to configure Roo to use a local ollama service as its provider. I'll note that I did not get around to tinkering with local MCP servers, so I'm not sure what impact that might have on my outcome, but it's on my list of things to pursue.

My goal was to export my list of ~100 recipes from onetsp for import into a self-hosted instance of mealie. How much could the AI do for me? Would it speed up the process?

My first task was to create some ansible scripts for setting up an environment to run mealie on my server. The qwen2.5-coder:14b model was pretty good with ansible, but it still runs somewhat slowly on my computer: ollama ps claimed only ~30% of it was on GPU.

I asked the LLM to make some basic changes like "add docker compose to the list of packages installed in this ansible script", and it completed it successfully, though it took about 60s. The same was true for "modify the user jmoiron to be in the docker group in this ansible script." These were things that, ultimately, I could do faster myself, even if I would have to look up the documentation. Perhaps better hardware or a smaller model would have made this style of prompting more useful.

Once mealie was running, I ended up having a much bigger task. The dump produced by onetsp had one plain text file per recipe, but mealie can only import recipes in HTML or JSON. Additionally, mealie lacks bulk JSON import, so I'd need to use its API to import my recipes.

To convert the recipes to JSON, I ran something very close to this prompt:

This directory contains text files with recipes in them. For each recipe, eg. "recipe.txt", convert that recipe to JSON using the schema at https://schema.org/Recipe and write the results to "recipe.json". Use a "recipeIngredients" attribute instead of "ingredients" for the recipe ingredients.

After running a more directed version of this and successfully converting one file, I asked it to process the rest of the files. The slow speed of the 14b model meant that each file took about 7-10 minutes, which made iterating on the prompt difficult, but since my earliest attempts at prompting worked, I was content to let it run. Writing the prompt was much faster than writing a script to do the same work, especially since the text files did not have much structure to them.

Unfortunately, after doing ~15 recipes, qwen somehow hallucinated a non-existent filename, and then Roo got stuck in a loop apologizing for not being able to find the file and then looking for the exact same file again.

I found it very difficult to get the model to skip over text files that already had associated JSON files. Each prompt I tried led the agent to say that everything had already been converted, or that there were no text files in the directory at all.

To get a faster token generation speed and faster prompt iteration, I decided to try out the smaller models. The qwen2.5-coder:3b model fits 100% in video memory and I can run it very quickly, but it was not capable enough to produce any useful work from any of my prompting. It might be better at smaller context tasks like code autocomplete.

The qwen2.5-coder:7b model fared a lot better, and was several times faster than the 14b model on my system. It produced JSON that was slightly wrong for the importer, but seemed write based on the schema. I was able to convince it to correct the files, and it would come up with the right corrections and apply them, so I ended up finishing out the last ~10 files that had been left untouched due to hallucinations with this model.

This whole process took ~2 days, but most of that was computation time that I was spending doing other things. In terms of hands-on prompting, and trying to coax it to ignore files correctly, and moving things around, it was may be an hour. I probably came out slightly ahead, compared to writing a script to do the conversion, but this is only because I had 100 recipes rather than 1000.

Next time I have a task like this, I would probably approach it differently . If I had a lot of files, and the conversion could reasonably be done without the advanced fuzzy reasoning of the AI model, I'd ask the AI to help me write a script to do the conversion. Even a terribly inefficient script would produce the output in a couple seconds.

Alternatively, I'd drive the AI from an external batch job, asking it to do a single targeted translation at once, so it can't get caught up in a hallucination and fail to make progress.

For the import script, rather than get the AI to generate something for me, I had found a mostly-working script in a discussion on the github around it lacking bulk json import features.

I wanted to add a little argparse interface to it so I could specify a single recipe to upload, which I asked the LLM to do, but it wanted to use its diff_apply tool to modify the code, and it kept producing a diff against a python "hello, world" rather than the actual code. At each failure, it would realize that it should use a different method to modify the file, but it'd then generate a diff and fail again in perpetuity.

I don't know that I have conclusions to draw from this experience, because nothing feels particularly conclusive. I still feel like I'm stumbling around in the dark, but even this slowly dissipates the fog of war.

One of my primary goals in exploring this space is to build an intuition for what these things are good at and what they're bad at, and I think it's starting to come into focus. There's enough space within the envelope of my ignorance that I could be on the wrong track, and the pace of development could invalidate the things I have right, but you have to start somewhere.

At this point, my understanding is that LLMs are very good with nuance and uncertainty, and not very good at tasks with high degrees of certainty, like enumeration. Fortunately, this makes them good at things I don't really know how to program well, like describing an image or summarizing some text, and bad at things that I know how to program easily, like batch processing jobs or using an API.

I have been pretty underwhelmed thus far with their abilities as a helper for actual programming, though I do have a particular project where I am going to lean into that a bit more and see if I can get it to work for me.

Feb 27

jmoiron plays the blues

LLM Agent AAR: Recipe Management