Types of image annotation — the ultimate guide

Humans in the Loop
8 min readDec 9, 2019

Wondering what kind of image annotation would be most appropriate for your project?

Image annotation is one of the main specialties of our humans in the loop and we are happy to share some insights on the pros and cons of each type: from simple vectors like bounding boxes or full pixel-wise semantic segmentation.

Bear in mind that this guide refers only to the visual annotation type but additional metadata can be assigned on the image level or on the object level. These include strings, numbers, booleans, single or multiple choice selection, etc. For example, a bounding box for cars in a scene can include tags such as color, make, license plate number, etc.

1. Bounding box

Annotation with a bounding box means that a rectangle (very rarely an actual box) has to be drawn tightly around the edges of each object. The output is usually the coordinates of x min/y min and x max/y max but could vary a bit depending on the tool. The goal is to help us detect and recognize the different classes of objects (e.g. “plate” 🍽️, “fork” 🍴, “cupcake” 🧁)

Pros: quite easy and quick to draw

Cons: depending on the type and position of the object, the box might include a lot of pixels that do not belong to it (background or other objects around it, such as in the case of the cupcake and fork next to each other). The Rotated Rect format (which also includes the degree of rotation of the box in addition to the x and y coordinates) could be helpful in the case when objects are diagonally placed, like the fork in the example 🥄.

Rating: ⭐️⭐️

2. Polygon

Polygons are a much more precise way to annotate objects by only including pixels that belong to them but they are a bit slower to draw. The output is usually a sequence of x, y coordinates for every point that comprises the polygon. A donut detector trained with polygons rather than boxes would arguably work much better.

Pros: more precise than boxes

Cons: the tool you use would have to support holes in order to work on objects such as donuts 🍩 or pretzels🥨 like in the example. The output in this case is split into exterior and interior coordinates. When polygons overlap, bear in mind what the order of objects is and which polygon would appear on top.

Rating: ⭐️⭐️⭐️

3. Point

Point annotation (also known as landmark annotation) is very useful for pinpointing single pixels in an image, as well as for counting objects as in the example with the cherries 🍒. The output is usually a single x, y coordinate but points can also have a bigger diameter.

Pros: easiest to draw and could be drawn even with very simple tools like Paint as long as the color of the point is unique. Very useful for counting objects in a bunch, such as people in a crowd

Cons: requires utmost precision and zooming in at a maximum in order to pinpoint exact pixels in the image

Rating: ⭐️⭐️

4. Keypoint

When we want to track the shapes of specific objects, using keypoints is a great way to compare the nodes and edges that compose the skeleton. This is most easily applicable with human figures or facial features but in our example is shown with pears 🍐: each pear has 10 consecutive keypoints that mark its main features and we can adjust them to the shape of each instance. The output is the x, y coordinates of the numbered keypoints always in the same order.

Pros: once the skeleton is set up on the annotation tool, the only thing needed is to adjust the nodes and annotation becomes much easier. It’s also helpful for annotators not to forget any node from the sequence which is a possibility if they have to insert them manually each time

Cons: objects must have the same regular structure in order for this annotation type to be useful

Rating: ⭐️⭐️⭐️⭐️

5. Polyline

The line annotation is great when we need to track a shape that doesn’t need to start and end in the same place (like in the case of polygons). When lines are composed of multiple points, each one recorded with its x, y coordinates, we talk about a polyline. The most common use case is lane marks on roads but in order to continue with the food examples, we’re showcasing how to compare the curve of the buns of different hamburgers 🍔.

Pros: very useful for specific use cases that need to trace a line or a curve (wires detection, lane detection, etc)

Cons: only work when the line in the image is close to 1 pixel wide. If it’s wider, a polygon might be needed so as to account for the width

Rating: ⭐️⭐️

6. Circle

When objects are perfectly circular, circle annotation might be really appropriate because it will save you a lot of time compared to having to draw circles by hand or with a polygon. In the case of these eggs 🥚🥚🥚, annotating them with a polygon would have been much less precise than using the circle.

Pros: works great for annotating objects like balls, coins, traffic signs, bottle caps or anything else that has a circle shape

Cons: if the objects are not perfectly circular, this type of annotation might not be that useful because it will not fit very well

Rating: ⭐️⭐️

7. Ellipse

When the target objects are not perfectly circular but it they still have uniform shapes, it would make sense to use the ellipse shape. For example, after the eggs from the previous example have been cooked 🍳, you can use the ellipse to annotate them precisely.

Pros: easy way to annotate not-so-perfect circles for which the circle annotation wouldn’t work. Still much faster than drawing a polygon around each shape

Cons: ellipses have to be adjusted in direction and width, which might take some time, and would not be applicable if objects are not exactly ellipses

Rating: ⭐️⭐️

8. Cuboid

Last but not least, what if we want to label objects in the 3D space and know their rotation or closeness? We can draw a cuboid starting from the further side of the object and extending it to the closer one to us, like in the case of these 🥞 pancakes, 🏺pitcher and fork 🍴. The output is a JSON file with the coordinates of the first bounding box (further side) and second bounding box (closer side).

Pros: this is a great way to annotate objects in space, especially relevant for visual imagery that is coupled with LiDar or other spatial data

Cons: may be a bit more difficult to draw, especially if objects have irregular forms or are occluded or truncated. requires even more work than the semantic segmentation. Also, not a lot of tool support cuboid annotation so you have to choose an appropriate one — maybe check out our annotation tools review?

Rating: ⭐️⭐️⭐️⭐️

9. Semantic segmentation

Semantic segmentation is a type of a pixel-wise segmentation where every pixel in the image is assigned to a semantic class. In the above case, the classes are: “carbohydrate” 🍚, “protein” 🍖 , and “vegetable” 🥬, in addition to a “background” class which covers all other pixels. This could work great for visually inspecting the nutritional composition of meals. The output is most usually a PNG mask with the colors of each class. However, it could also be a JSON file with bitmap objects as base64 encoded strings.

Pros: ultra-precise since every pixel must be assigned to one class. It focuses on interpreting the pixels in the image and that makes it essentially much more different than vector annotation (examples 1–8) which is focused on object detection.

Cons: takes a lot of work in order to segment the image. In the above case, the scene is segmented by hand using a brush and an eraser which can cause a lot of hassle. Fortunately, there are quite a lot of tools that support superpixel annotation which splits the image into larger tiles based on edge detection with adjustable granularity. So the user only needs to color the superpixels as needed. It could also be done using polygons.

Rating: ⭐️⭐️⭐️⭐️

10. Instance segmentation

Instance segmentation is similar to the previous case but solves one challenge: what if we want to segment each item of food separately and know how many there are while knowing what class they belong to? So for these pumpkins 🎃 we do not only segment them and label them as class “pumpkin” but as separate instances of the class “pumpkin”, so we know which one is number 1, number 2 and so on. The output is again a PNG mask or a JSON file with coordinates/bitmap.

Pros: this is the most precise and accurate annotation type which allows us to know how many instances of each class there are in the image

Cons: requires even more work than the semantic segmentation. For example, if the pumpkins overlap, in the semantic segmentation method we would mark them all together as one single mass of the class “pumpkin”, while in semantic segmentation we still have to draw each one of them individually

Rating: ⭐️⭐️⭐️⭐️

Interested in having your dataset annotated with any of these annotations?
Get in touch with our team at Humans in the Loop and a project manager will advise you on the best strategy!

--

--

Humans in the Loop

Image labeling for Computer Vision | Bringing jobs to refugees