An MIT Research Develops A New Machine Learning Model That Understands The Underlying Relationships Between Objects In A Scene

Source: https://arxiv.org/pdf/2111.09297.pdf

Deep learning models do not see the world the same way we humans do. Humans have the ability to see objects and interpret their relationships. However, DL models don’t comprehend the complex interactions between individual objects.

To address this issue, researchers from MIT have developed a machine learning approach that understands the underlying relationships between objects in a scene. Firstly, this model illustrates individual relationships one at a time. Then it combines them to describe the complete picture.

The new framework creates an image of a scene based on a text description of objects and their connections, such as “A wood table to the left of a blue stool.”

To accomplish this, it first breaks these words down into two smaller parts, one for each specific relationship (for example, “a wood table to the left of a blue stool”), and then model each part independently. After that, an optimization method is used to assemble the components, resulting in a scene image. Even when the scene contains several items in varying relationships, the model can build more accurate images from text descriptions.

Most DL systems would consider all relationships holistically and generate the image from the description in a single step. However, these models fail when dealing with out-of-distribution descriptions, such as those including more relations. This is due to the fact that these models can’t actually adopt one shot to generate images with more relationships. According to researchers, their model can simulate a higher number of linkages and adapt to unexpected combinations because it combines these different, smaller models.

To capture the individual object associations in a scene description, the researchers employed a machine-learning technique called energy-based models. Using this model, the method encodes each relational description and then compiles them together, inferring all objects and relationships.

The system can also recombine the lines in a variety of ways by splitting them down into shorter chunks for each relationship, making it better able to adapt to scene descriptions it hasn’t seen before.

What’s more interesting is that the method can also work in reverse, finding text descriptions that fit the relationships between elements in a scene given an image. Furthermore, their model can also alter an image by rearranging the scene’s elements to fulfill a new description.

When evaluated with other DL models while they were given text descriptions and instructed to create visuals depicting the objects and their interactions, the proposed model outperformed the baselines in each case.

Humans were also …….

Source: https://www.marktechpost.com/2021/12/01/an-mit-research-develops-a-new-machine-learning-model-that-understands-the-underlying-relationships-between-objects-in-a-scene/