[ad_1]
A deep dive into how the Section-Something mannequin’s decoding process, with a concentrate on how its self-attention and cross-attention mechanism works.
The Section-Something (SAM) mannequin is a 2D interactive segmentation mannequin, or guided mannequin. SAM requires person prompts to phase a picture. These prompts inform the mannequin the place to phase. The inputs to the SAM mannequin are a 2D picture and a set of prompts. Customers prompts inform the mannequin the place to focus. The output of the mannequin is a set of segmentation masks at completely different ranges and a confidence rating related to every masks.
A segmentation masks is a 2D binary array with the identical measurement because the enter picture. On this 2D array, an entry at location (x, y) has a worth 1 if the mannequin thinks that the pixel at location (x, y) belongs to the segmented space. In any other case, the entry is 0. These confidence scores point out mannequin’s perception on the standard of every segmentation, greater rating means greater high quality.
The community structure of SAM consists of an encoder and a decoder:
- The encoder takes within the picture and person immediate inputs to provide picture embedding, picture positional embedding and person immediate embeddings.
- The decoder takes within the varied embeddings to provide segmentation masks and confidence scores
[ad_2]