sense and sensibility: ML note detector
for this project we are asked to investigate the capabilities of mainstream ML and generative AI resources we can access from the web. we've taken a look at google's teachable machine (which i'll experiment with), runway and other generative frameworks such as stable diffusion and midjourney. we also learned how to import some data from google teachable machine into P5.js and to use ML5.js - which both proved to be surprisingly approachable from a coding point of view.
One important thing we are asked to consider is to reflect on the practical, philosophical, and ethical implications of artificial intelligence. A word that I really love was used too: umwelt, describing an organism's sensory array used to perceive and describe the world around them (or it, if we're talking about computers).
now, umwelt is a concept that i've also explored in my last year's design domain project (aptly named "SENSO" = sense). in a few words, I used a multitude of sensors to investigate how a computer manages to read a human body, and how such readings can be made use of (or translated into) in order to convey a POV - which is, the machine's point of view. i'm very glad that this concept is brought up in this case as well, because in the case of machine learning the capabilities broaden and, in a way, make the concept itself more "fluid and dynamic".
when I was working on SENSO (or to be fair, any concept that involves hooking up a sensor to my laptop) i was thinking about why i'm interested in working with computers in general. i'm going in a bit of a tangent here, but basically i use computation as my medium because i'm interested in approaching a concept in the most "inhuman" way possible, detaching my persona from the artwork as much as i can. this mainly comes from an interest in science, biology, chemistry, etc - what i just call nature for simplicity - and because i firmly believe that to achieve a level of un-bias-ness i can only rely on numbers. of course, numbers are a purely human invention, in fact paradoxically i recognise how it is impossible to detach such human biases, but 1) numbers are the only closest thing to pure nature and 2) numbers are the only real universal method for people to approach the world we live in, so, in a way, what i want to do in any of my artwork is trust the numbers for what they are and what they give AKA choosing the less worse option in terms of universally comprehensible (for humans only, that's why it's the less worse) ways to see the world. and what can process numbers better than computation? even the act of coding requires a de-humanisation of sorts. we have to re-learn how to speak in order to speak to a machine, altering our language in order to make our intentions comprehensible for a computer, optimize it so that our language can also be used on differently capable machines. coding (to me) is about speaking through first principles so that every compiler can understand what we're saying, and in my opinion this is how nature works. modularity isn't just cool because it gives us infinite possibilities through the use of these first principles, but it's beautiful and inspiring because it's just how the universe works: DNA is an example of modularity, atoms are an even stronger example of modularity...
AI kinda takes all of this and throws it out of the window, it personifies the machine and this can lead to incredibly beautiful and painful results alike. the first principles disappear which yes, simplifies how we talk to the machine, but also enormously complicates how we can understand it. don't get me wrong i have a neutral position in regards to ML and AI in general, but i tend to separate AI from the computation that i described earlier, to me it's a different category. to put it simply if i see computation as a derivative of nature i see AI as a branching from the human perspective (of course humans, like anything, are derivative from nature too but hope you get what i mean). that's why we encounter issues of very humanly made bias such as racism, sexism, classism: unnatural biases only born from grey matter that has too much free time on its hands - okay if someone reads some neurobiology books we can see how such biases are present in many animal species as well, but they have very different origin points and reasons to be. let's be honest, there's a difference between hyenas' evolutionarily generated behaviour and humans' socially constructed behaviour. all i'm trying to say is that the computer as we've always known is not simply well-arranged electronics, it's a mechanism that is inspired by nature that developed long before the first ever calculator, and long before the first ever Jaquard Loom - i know it sounds cheesy as hell, but to me a computer is simply a way of thinking. AI and machine learning, on the other hand, is designed by humans in an image inspired by humans themselves. it's meant to think like a human, speak like a human and work like a human for humans themselves. all i've heard so far in regards to AI can be summed up with the sentence "it's useful for (something)" but i'm a firm believer that it will never be "necessary for (something)".
that said, i'm excited to start playing around with some machine learning. i'm VERY reluctant on playing around with image generation simply because i decided to boycott that capability, given why it's been developed so much (mainly to get rid of artists' fees), although i can see why it can be used to create placeholder images for some concept of work.
After receiving our tutorial on google's teachables machine i decided to work on something that could be useful for me, a note detection model. I'm a stage designer, motion graphics designer and lighting designer, and one fundamental part of my work is about following the music in an event and generate motion graphics that interact with the audio via audioreactivity. note detection is a "very extra" feature in my work, given that i already have some useful systems for distinguishing low, mid and high range frequencies and detecting snare, kicks and hi-hats, but sometimes note detection can be very useful. for example: i work a lot with jazz bands. differently from more conventional club music (mostly EDM) a jazz band gives a huge amount of different frequencies, as in, the frequencies in a kick from an 808 are way more controlled from the ones coming from a kick from a traditional drum kit, this because the drum kit also outputs a lot of uncontrolled frequencies such as the material of the plates that vibrate with that kick, or like the whole metallic body of the kit that vibrates along with that single kick. this rule can be applied to any traditional instrument: the timbre given is not only a piano's key pressed and it's vibrating chord, but it's the whole body that vibrates with that key being pressed. so let's get back to a jazz performance. all i can distinguish from a band playing is the lows, the mids and the highs, and then suddenly a sax solo arrives, what do i get? I might get mostly highs and some mids, and maybe some lows. additionally i'll also get the volume of those 3 channels, but that's all i'm working for, and if i'm aiming for tightness and cohesiveness of visual elements (be it lights or motion graphics from a projector or led panel) that's an unsatisfactory amount of data. this is why note detection can be useful: i can have elements triggering for every single note being played, not solely the overall volume. what i'm hoping for here is that a ML model could skim through all of the muddiness generated by a traditional instrument and give a clear output for each single note. normally if i'm aiming at using motion detection i'll just use ableton with some max device, but that's always clunky to set up and it's never really perfect. later we'll see that ML is actually worse than that but let's take a look at it.
teachable machine can be trained on sound data, which is pretty cool. the problem is that i can't really upload sounds that i exported myself, it needs to be recorded in real time for some reason, or it must be sounds exported from the google model itself. anyways i decided to start with something that is somewhat simple to recognise: the sound from Ableton's operator, a simple sine-wave that should be easy for the model to be trained on. using my laptop's microphone (i was away from my flat during this training so i had to do what i could with what i had) i recorded about 16 samples for each note. i decided to just focus on one full scale, taking into account the pure notes and their semitones as well.
Here's a wee video showing the model in action. I'm recording this video after a while, so the different background noise might be a bit different from what it's used to. although it gets confused quite easily, i think this shows that it might actually work with A LOT more training.
on the P5.js side of things the work to get the model running was simple. I only had to paste the model's URL int the example sketch that was given to us in the class and modified it a bit to show what note was actually playing. if the confidence on a note is above 60% the sketch will write the correspondent note on the screen.
after this little experiment i concluded that there isn't much that can be done with this technology, mainly because google gives a pretty limited amount of things that can be done and data that can be exploited. i feel like these frameworks are "uni-taskers" as in they do one thing and one thing only. all I can do with that is "if this is 0 do this, if this is 1 do that" and this felt pretty limited... it's very sad that we're not allowed to work on more complex versions of machine learning through runway because i think that would have been an interesting way of achieving more varied things and get more varied outputs.
MediaPipe in TouchDesigner
with a bit more research in trying to find something more versatile to work with i ran into a very interesting plug-in built for TouchDesigner. this plugin was developed by Torin Blackensmith and Dom Scott and they did an amazing work in porting the MediaPipe framework into TD. MediaPipe is a cross-platform pipeline framework to build custom machine learning solutions for live and streaming media and, because it was developed in C++, it can be ported on lots of different OSs and programming environments.
The plugin works pretty easily out of the box and can be downloaded from here, just make sure to download it from the release section on the right. when we run the project containing the plugin we are greeted with a container named "MediaPipe" already displaying what it's capable of. The framework allows us to do hand-tracking, face-tracking, pose-detection and object-detection. Alongside all these features in TD we are able to also see all of the data being tracked, giving us numerical data in regards to, say, eye-blinking, eyebrows position, mouth angles, hand velocity, finger position and so so much more.
I'm going to make use of the face-tracking data to create an interactive face filter. basically what I want to do is a toy-filter that changes imagery on screen according to my expression of emotion. emotion detection would be a good way to do this, but i couldn't find a free method to do it online, so i'll rely on the face-tracking values given this plugin to approximate it by defining how values regarding eyebrows, eye, mouth and cheeks can determine some level of emotion. I will also make use of this data to change the sound of my voice through ableton - for example by toning it down when angry or up when happy, or even adding special sounds in order to describe my state of mind. Along with the filter I want to develop a stylish HUD that shows how certain things get triggered and how my face's emotional state is perceived.