Applications of Delaunay Triangulation and Edge Flipping in Facial Recognition and Detection

(subtitle: Make Yourself Cubist™)

More formally:
here.

Note: This page exists as an archived version. The finalized project is available at the link at the top of the page and here.

Open Snapchat to take a selfie. Fix your hair, tilt your chin just so, perhaps parlay a smize and — No no, pause. You realize now that this is the perfect moment. This is the moment where you take a selfie such that you become another iteration of that canonical dog-eared millennial archetype.

Okay, reposition, prep that puppy pout and tap the screen. And then, there, in that brief moment in the interim, between pressing the screen and when a cascade of filter options appear, it appears to you, like a flock of magical realist butterflies in Macondo: a triangulation.

At first glance, it is only a hallucinatory splice of memory from a Computational Geometry lecture, long ago. But no, it is truly there: the revelation of the structure that is the topology of your face.

There is a brief moment in the interim, between pressing the screen and when a cascade of filter options appear, when the Snapchat screen makes itself vulnerable — or seemingly so. Perhaps the algorithm (point location, triangulation, alpha mask generation, etc.) is simply not fast enough to go unnoticed. Or perhaps, in hinting a glimpse at your triangular decomposition, a user feels that tantalizing endorphin-surging rush of peeking behind the curtain as the Wizard soliloquizes about the mysteries of the “cloud."

Or perhaps, like most users, you never noticed the triangular mesh on your hurried way to transfigure yourself into some canonical dog-eared millennial archetype. That’s about to change.

A triangulation is a maximum planar subdivision - a breaking of a polygon or a point set into triangles. Segmentation of a larger space into manageable pieces allows for a myriad of complex operations — point location (i.e. google maps), an optimized distribution of vantage points (i.e. art gallery guarding), or, in this case, the creation of a mesh of a face for computer graphics rendering.

As with all divisions of space, there exists a multiplicity of means to deconstruct a face. This project was commenced from an initial search for an optimal facial triangulation.

What does it mean to have an optimal facial triangulation? In the context of this project, it is a fast yet reliant algorithm to render an accurate mapping of facial landmarks (eyes, ears, mouth and nose) for later processing. We can think of it this way — in order to allow for persistance of a projected filter onto a face, the algorithm, for detection of landmarks and their subsequent division via relationship-denoting edges, must return a triangulation that is rooted in the variable nature of a facial construction.

A facial triangulation algorithm must detect facial landmarks and then triangulate that representative point set such that graphics mappings can be persistently implemented, irrespective of shifts in light, direction or background noise in the camera image capture.

This project seeks to flesh out what is known as algorithmic transparency — herein defined as a clear and meticulous dissection of the numerous steps of an algorithm, for the goal of elucidating the structure and decisions of a perhaps-opaque-seeming technology. In short, to allow a user to peek behind the wizard’s curtain.

Why?

We delineate spaces without thinking. This is a way to think about how we create referential landmarks to ourselves and to other points. This is an exploration of space, divisions of space and the landmarks we forget to notice on the landscapes that we most frequently apotheosize and excoriate concurrently — our own faces.

A copy of the more formal research paper that resulted from this project is available in PDF format via the link at the top of this page

Pixels

Understanding the process of triangulating a selected face region (i.e. what we need so we can put a cutesy infantilizing dog filter on it!) requires understanding how a computer understands a photograph.

A pixel - unit of measurement - a square on your screen - the number of which per inch (the correct usage of “ppi” not “dpi” will earn you bonus points with the 𝜋 people who care and put you ~3𝜎 away from the fickle favors of the general population, but now you know) can reveal more and more ugly details of any given image.

A single pixel contains a wealth of knowledge that is both bountiful and generally rather accessible; there are no offshore Cayman Islands accounts for pixel info — not yet, anyways.

Image pixels are stored as one-dimensional arrays. While we view an image as having two dimensions (Euclidean x,y coordinates or “height” and “width”), the pixels are stored in one contiguous strip. A pixel at location (x,y) in a picture is indexed in the pixel array at [x+ y*width].

2. Convolution

Take a picture. Put another over it. This is like every problematic makeover scene in every teen movie that defined my early years before I found Hal Ashby and Agnès Varda: whatever veneer you choose to smear over the face of the original image will make it unequivocally better than its previous iteration. Did you know that self-hate is a native data type in a JPEG?   

A convolution is like putting a piece of stained glass over a window pane — it does not technically alter the original data of the image, but it does alter our view — we “convolve” an image with a convolution matrix, by passing a small amount of window paint / across the / window pane. (Sorry, Sylvia, not my best.)

Once you’ve painted the window with this convolution matrix, the image will look different. A swipe of lipstick and sad.jpeg becomes missCongeniality.jpeg to our program’s eyes.  

Ceci n’est pas une fenêtre? Remember, we’re not looking at something real — we’re looking at a representation of something real. It’s not really your face on the screen, it’s a representation of your face on the screen that you have been fooled to believe is truth; your face is actually just malleable numbers organized into a straight line.

Take, for example, a Gaussian filter.

What?

Essentially a reduction in noise. More formally, setting value of pixel to a value that is the normalized, weighted average of the pixels in a set neighbourhood around that pixel. We are making the picture less crisp, less defined, less interesting — perhaps an Orwellian convolution of a pixel via a given matrix such that a flatter and more complacent range of values correlating to thinkspeak results.

Gaussian filtering opens the doors for other operations — Sobel Edge detection, corner detection, and greater accuracy for finding facial landmark points.

See Resources:

Gaussian Smoothing
Histogram of Oriented Gradients
Image Recognition and Object Detection with OpenCV
Face Recognition using Histograms of Oriented Gradients
Image Kernels

3. Face Detection

Face detection in an image essentially works in two parts here: (I) detection of the “blob,” or “contour” or “shape” of what is recognized as a face; (II) detection of the landmark points that make up a face — i.e. eyes, nose, mouth, jawline.

For ease and simplicity, I have used the OpenCV module for Face Detection to create a bounding box from which to work for the remainder of this visualization.

4. Point Location

To be able to generate filters, image manipulations on our bounding-box-designated face region, we will have to first locate the "landmark points" that make up a face. Landmark points can be understood as a very basic division of the face into eyes, ears, mouth and nose, but distilled to a few key coordinates for each. In a sense, a rather emoji-like rendering of your face is distilled from all of the information. Isn't it nice to know that you can be reduced to a post-human version of yourself faster than you can say ex-Machina?

Let's understand how to do that.

Facial Landmarks

We want to determine “facial landmarks.”  There are several different libraries used commonly for Machine Learning related to facial detection/ recognition which rely on libraries of hand-drawn facial landmarks on “generic” faces (this is perhaps playing into standards of Euro-centric standards of Western beauty and “accepted” facial structures — an important point to note that I will not delve into here but which can be explored here and here and more in-depth here ).

These libraries (i.e. iBUG etc) use these input facial “maps” as a guide — once interest points are detected, the generic 68-point mapping of facial landmarks can be used to refine and adjust the location of the landmarks on an input image.

Euclidean distances between detected interest points and the location of landmarks on inputted maps can verify or discard interest points until an approximate set of 68 points (or more or less, depending on the library) is determined.

For this visualization, I have chosen to forgo utilizing ML technologies and external libraries of facial landmarks for simplicity’s sake. In the narcissistic spirit of selfie-culture, I have used a singular image of my own face as a baseline.

Using reference images from the 68-landmark-point model, I approximated this path on my face. This process strangely parallels the source material used in databases for these libraries -- I am using hand-drawn approximations to use as a comparative model. This is a rather naive, perhaps brute-force way of approximating facial features, but it accomplishes the task needed for this visualization — which is simply, to make your face “Cubist” via a triangulation mesh.

This base image -- selfie with self-approximated landmark points -- was then condensed further.

b) Use convolution matrix of Gaussian filter to “smooth” image; poetically, this is the blurring of lines that separate our bodies from the space that surrounds them; algorithmically, this is a way to quantize pixel brightness values to refine the pool of potential landmarks while still retaining the majority of original “edges” in an image.

c) Use Sobel edge-detection algorithm on smoothed image. This will detect "edges" (places where there is a large brightness shift between adjacent pixels (pixels that are located either vertically or horizontally next to one another but NOT those pixels that are located next to each other on a diagonal).

d) Use Shi-Tomasi Corner Detection - using the stack of detected edge pixels, we can use this smaller data set to determine “corners” — which we will then then refine to interest points.

Shi-Tomasi Corner Detection Algorithm

Using a 3x3 matrix of input pixels, we calculate the average difference in pixel brightness to find areas — to be further refined to points — of high contrast.  Why?

Contrary to the Photoshop trigger-happy poreless finish of the advertising world, our faces (all surfaces) are made up of planes which, courtesy of undulating concavities and convexities, reveal dimensionality — also known as what the hair-care world has co-opted as “volume.”

Find these areas where light meets deep shadow, and you have located facial landmarks, which we can use for a myriad of recreational purposes like mapping filters of mildly fetishized forest creatures or for creating an NSA database for quick and painless identification. Endless possibilities.

Let’s make some triangles for now.

6. Convex Hull

So we have points. That’s nice. Remember when your 2nd grade art teacher critiqued your family portrait because you drew the faces askew and the dots for the eyes melting into the bottom of the chin and you told her that she a soul-eviscerating robot for disregarding the merit of your work because André Breton would really enjoy your embrace of a visual perspective that veered away from conventional aesthetics and more towards a way of seeing that embraced the ideal of rêver en éveillant? (I wouldn’t know, I skipped second grade but I felt that this would be a good time for you to unload some emotional baggage. That’s really the entire point of AI isn’t it? (See Her)).

Yeah, that’s not this moment. Hopefully, this program didn’t detect the points for the eyes as being off the bounding box and the nose has a couple of points to be classified as a blob (not the horror flick. Also, a real term. I know, I was also surprised. Science is astounding, sometimes.)

Either way, you’ve got face points. Now we’re drawing a boundary for our points.

But WAIT. Aren’t those points already within a bounding box (you shout at the sky, tears streaming down your face)? Isn’t this a frame within a frame already? When does the frame end and the picture begin? What about a post-structural space of viewing that doesn’t align itself within conventionality in the restrictive setting of a gallery which itself plays a part in institutional systems of oppression?

Astoundingly, a convex hull, while a frame within a frame (meta, wow)), is also a beautiful way of demarcating space. We want the smallest boundary. Almost as though we want the boundary that can hug all of its points while still remaining aloof (do you get the portrait now, Ms. Richards?! Symbolism is not something I invented when I was seven, it’s a really poignant device for explaining relational aesthetics without implicating an explicit narrative in a work.)

7. Triangulation

Sets of three are not fun if you want a college romance but they are nice when you are looking for ways to group points sets in a way that will translate to an even partitioning of space.

David Mount: Lecture Notes on Triangulating Monotone Polygons

For a deeper understanding of the foundations for the section, please see this paper by Chazelle on linear-time decompositions of simple polygons. Triangulating a Polygon in Linear Time

8. Delaunay Triangulation

It is called the “most beautiful triangulation” amazingly not because it buys into the 90s heroin chic aesthetic of emaciation = beauty, but because, rather than preferring the thinnest triangles, this triangulation prefers an organization of triangles which maximize the minimum angle in each triangle.

9. Voronoi

Voronoi is a tie-breaker algorithm — it determines the delineations between points such that the dividing line between each two input points is equidistant from those two points. A medial axis, if you will.

There are some nice real-world understandings of this:  The range of a group of cell towers in a given city. Sad cubicles in a generic, soul-eviscerating job that functions as a cog in (*insert your favorite bureaucratic corporate conglomerate here*), meant to keep each person as close and as far from each other as possible at all times. This Creative Commons image of cells also illustrates a Voronoi diagram in nature — of which there are a plethora.

Voronoi diagrams can perhaps feel isolating but they can also be protective — as in determining the optimal spacing of hexagonal plates on a tortoise shell so that the convex structure can maximize interior room (understanding that this is also looking at a 3D representation of a Voronoi diagram but, for poetic license, just allow me this example).

Voronoi diagrams map distances from neighbors. Intuitively it follows that a Voronoi diagram (the dual of the Delaunay Triangulation) stems from the Minimum Spanning Tree and Nearest Neighbor Graphs of a given point set.

Like a palimpsest, every layer of graph builds upon one another, each subsequent layer containing more edges, explaining different relationships.

NNG ( MST ( RNG ( GG ( DT

References:

Academic Papers

R. Herpers and G. Sommer “An Attentive Processing Strategy for the Analysis of Facial Features”
Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University “Finding Tiny Faces”
Dani Lischinski “Incremental Delaunay Triangulation”
Hervé Jégou Inria, Rennes, & Andrew Zisserman, Dept. of Engineering Science, University of Oxford “Triangulation embedding and democratic aggregation for image search”
Marc Alexa et al. “Computing and Rendering Point Set Surfaces”
Faniry Harijaona Razafindrazaka African Institute for Mathematical Sciences (AIMS) "Delaunay Triangulation Algorithm and Application to Terrain Generation"

General Resources:

In Defense of Nearest-Neighbor Classification