SMU Office of Research and Tech Transfer – The steak frites arrives. The diner scoops half the fries onto the side plate and replaces it with a small roll, and then takes a picture of the food. The food recognition app shows 37g of carbs. The diner, who is diabetic, enters the carbs into the insulin pump and gives himself the bolus to cover the meal. With the app, he is able to keep blood sugar levels tight and diabetic complications at bay.
The beauty of the app is that it does not rely on manual data entry for food monitoring and nutrition analysis. Just snap and know what you are eating to achieve a healthy lifestyle. That is the area of research Ngo Chong Wah, Professor of Computer Science at SMU School of Computing and Information Systems, is involved in.
“My main research topic is ‘multimedia’, where one of the core problems is how to ‘conquer the semantic gap’. Very simply put, I try to make the computer understand the content of the image/video which is simply a bunch of pixel values,” he tells the Office of Research and Tech Transfer.
How it all began
Professor Ngo started working on this research topic when he was an assistant professor at the City University of Hong Kong in 2002. At that time, TRECVid (https://trecvid.nist.gov/) initiated an annual video benchmarking activity for researchers working in this area to evaluate and compare their algorithms on large video datasets.
“I found the series of tasks, such as annotating video content with English words and retrieving video shots for queries written in English, challenging and meaningful, and have been working on them since,” he says.
Professor Ngo recalls how, when deep learning came about in 2012, he was very impressed by its ability to recognise image content. “It was not only better - but so much better - than any of the results we’d seen”.
At that time, he was also inspired by Japanese researchers who started up companies for food image recognition. “It sounded amazing to me that we could quantify what we eat by taking pictures of food. I started to work on applying deep learning in recognising ingredients and cooking methods from food images”, he adds. The results were very encouraging and he and his students won some research awards, including the Best Student Paper prizes at the 2016 ACM (Association of Computing Machinery) Multimedia and 2017 MultiMedia Modeling conferences.
Cross-domain and cross-modal food transfer
Recent works in cross-modal image-to-recipe retrieval pave a new way to scale up food recognition, says Professor Ngo. ACM Multimedia recently published his study in that area, ‘Cross-domain Cross-modal Food Transfer’.
According to Professor Ngo, cross-modal retrieval simply means retrieving data from an input and output which are in different modality. For example, the input could be an image and the output a text description (e.g., recipe). “Let’s suppose we query an image of a dish. The system then retrieves the recipe of the dish. Some ingredients are not visible in the image. However, the recipe provides basic information such as how much oil and sugar were used for cooking, which is useful for calorie estimation.”
Cross-domain transfer is about using very little training examples to re-train a recognition engine to recognise food from different cuisines. “In other words, if there is a software to recognise Chinese food, how can the software be modified and trained to recognise Malay food? Cross-domain retrieval means ‘I use my knowledge about Chinese food to recognise Malay food’”, he clarifies.
The research paper addresses the challenge of resource scarcity in the scenario that only partial data, instead of a complete view of data, is accessible for model transfer, Professor Ngo says. Partial data refers to missing information such as the absence of image modality or cooking instructions from an image-recipe pair.
To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularisation, and adversarial learning, is proposed for cross-domain transfer learning. Experiments were conducted on three different cuisines (Chuan川 or Szechuan, Yue粤 or Cantonese, and Washoku和食 or Japanese) to provide insights on scaling up food recognition across domains with limited training resources.
Challenges and applications
The first challenge that researchers try to address in food recognition is scalability. “Recognising a couple hundreds of food images is not a problem now but scaling up to thousands of images remains very challenging,” says Professor Ngo.
Another challenge is that it requires a lot of training examples for model training. The training examples are difficult to collect because it requires manual labelling of ingredients in every food image. So, researchers look for solutions on how to train a model for food recognition by using less training examples.
The third challenge is cross-domain and cross-lingual transfers. The food datasets constructed by researchers from each country are different in terms of cuisine and language used in the recipes. Consequently, a model trained using western food will yield poor recognition results on Chinese food, for example. So, research effort is ongoing to study how to train a ‘universal model’ that can recognise different cuisines and retrieve recipes written in different languages.
To overcome the challenges, Professor Ngo says that instead of collecting and labelling food images as training examples, they used paired information (image-recipe pairs) for model training. “Specifically, instead of labelling the ingredient in a food image for training, the model is fed a recipe that describes the cooking process of the corresponding food for learning. The model is trained to be able to ‘self-learn’ to recognise which regional image corresponds to which ingredients, and a particular appearance of an image corresponds to which cooking action (e.g., cut, slice, dice) in a recipe.”
The idea is to train a model to describe the cooking process of food (e.g., the ingredients used and how they are cut or cooked to produce a dish) with a feature representation that is generic to different languages and cuisines. If this is doable, adds Professor Ngo, food recognition can be achieved by retrieving recipes to quantify the food content in an image.
What Professor Ngo wants to achieve in multimedia computing is to make multimedia search as easy as text or Google search. “Search is a basic function for any application, but search for multimedia data remains challenging especially when the images/videos are not accompanied by text descriptions,” he says.
His research interests lie in bridging the gap between machine, user, and data. He is interested in studying algorithms that allow humans and machines to engage efficiently to find a target from a large database, what researchers like to refer as “finding a needle from a haystack”.
The research topics are application-driven, and he says that it is always interesting to work on something that the industry has not thought about. “I am excited to see my students being recruited by IT companies or research labs because they worked on the topics which are new and relevant to industry during their PhD years.”
Source: Research@SMU May 2021 Issue
Last updated on 23 Jun 2021 .