Generating Recipes
Explore the research related to generating recipes with deep learning.
We'll cover the following...
As eating habits and culinary practices have evolved over time, with a shift toward consuming more meals from third-party sources like takeaways and restaurants, obtaining detailed information about prepared food has become increasingly challenging.
Behind each meal, there is a story described in a complex recipe, but unfortunately, we cannot access its preparation process by simply looking at the image. This limitation underscores the need for the development of a reversed cooking system capable of deriving ingredients and cooking instructions from a prepared meal. Here is where generative models come in.
One advanced application of generating textual descriptions of images using GANs involves producing a structured description of an image with multiple components, such as generating a recipe for
As shown in the figure above, this description is also more complex because it relies on a particular sequence of these components (instructions) to be coherent.
Generating recipes from images
Generating a recipe (including its title, list of ingredients, and preparation instructions) from an image is challenging. It necessitates a comprehensive grasp of the ingredients and the various processes they undergo, such as slicing, blending, or mixing.
One way to achieve this is that instead of obtaining the recipe from an image directly, a recipe generation pipeline would benefit from an intermediate step predicting the ingredients list. Then, the sequence of instructions would be generated conditioned on both the image and its corresponding list of ingredients, where the interplay between the image and ingredients could provide additional insights into how the latter were processed to produce the resulting dish. This process is called inverse cooking, where the cooking recipes are recreated given the food images.
Inverse cooking
As the figure below demonstrates, the inverse cooking problem has also been studied using generative
One embedding represents visual features extracted from an image.
The second one encodes the ingredients extracted from the image.
The instruction decoder is composed of transformer blocks, each of them containing two attention layers—a technique to improve the performance of models by focusing on relevant information—followed by a linear layer. The first attention layer applies self-attention, which captures dependencies ...