Spectrum analyser in Music does nothing August 14, 2006 9:16 PM   Subscribe

Is there a reason for a spectrum analyzer on the player that clearly does absolutely nothing related to the music playing? I've noticed this on a few other flash mp3 players, including Myspace's. Is there a way to make it not there?
posted by potch to MetaFilter Music at 9:16 PM (41 comments total) 6 users marked this as a favorite

I know it's colossally anal to complain about this, but it just seems to have no function whatsoever, and increases the visual footprint of the player.
posted by potch at 9:17 PM on August 14, 2006


I don't see a spectrum analyzer. Am I missing something?

And the Myspace player's "spectrum analyzer" is just a stupid Flash animation, I think.
posted by joshuaconner at 9:50 PM on August 14, 2006


It's in the playlist player.
posted by potch at 9:53 PM on August 14, 2006


Good God, that is annoying as hell.... now that someone has pointed it out.
posted by crazyray at 9:58 PM on August 14, 2006


Why do you hate deaf people?
posted by Galvatron at 10:05 PM on August 14, 2006


MP3 and other modern forms of sound compression don't operate at the level of wave forms. You can't compress the sound enough when you're trying to do that.

Instead, the encoding process takes a certain chunk of the sound, the wave form for a few milliseconds, and the performs a Fourier analysis of it. What gets stored is more or less the coefficients of the Fourier function, after low amplitude coefficients and phase relationships are discarded.

To play the sound, the coefficients are then used to reproduce the waveform. (Though since the original phase relationships are not preserved, the resulting wave form if displayed on an oscilloscope will not usually bear any resemblance to the original.)

The "spectrum analyzer" you're seeing is really just a barchart display of some of the coefficients being taken from the MP3 as it's being played. The reason they do it is because they've already got it, and it's easy to display it and the display looks kind of cool. The compute cost is negligible.

(The codecs used in cell phones are even more elaborate than what I just described, I might mention. I used to work with some codec designers, and they're black magicians. Cell phone codecs are optimized for human voices and get preposterously good compression as a result.)
posted by Steven C. Den Beste at 10:24 PM on August 14, 2006 [15 favorites]


^^^Holy shit^^^
posted by evariste at 10:27 PM on August 14, 2006


I wonder what relationship VOIP codecs have to cellphone codecs.
posted by evariste at 10:29 PM on August 14, 2006


By the way, that's why it's possible to play back MP3's faster or slower without the frequency rising or falling. The playback treats the Fourier coefficients as describing a shorter or longer period of time than originally was the case. It still creates the same frequencies -- for longer or shorter durations.
posted by Steven C. Den Beste at 10:31 PM on August 14, 2006


VOIP codecs don't have the same bandwidth constraints as cell phone codecs. Most modern voice calls are limited to 8000 bits (1000 bytes) per second, which really isn't very much. (I believe that 160 kilobytes per second is considered pretty normal for MP3's, isn't it?)

The Enhanced Variable Rate Codec (EVRC) which is current state of the art for IS-95 and CDMA2K usually uses much less than 1 kilobyte per second. Depending on the sound, it can be as little as 1000 bits (125 bytes) per second. (The sound is divided into 20 millisecond frames, and for each frame the "variable rate" codec decides to create either a full-rate, half-rate, quarter-rate, or eighth-rate packet, depending on how complex the sound was during that 20 millisecond period.)

VOIP operates over a fat pipe, so compression isn't as critical. I don't know specifically what they do use but I'm sure it isn't remotely as elaborate as what cell phones use. The compute cost for that kind of compression is too high, and all those algorithms are patented five ways from Friday.
posted by Steven C. Den Beste at 10:40 PM on August 14, 2006 [2 favorites]


SCDB, potch is referring to the MetaFilter Music flash player that appears in the playlist pages. It has a useless "frequency spectrum" window that appears to be randomly generated. Providing a representation of DCT coefficients would be a huge improvement.
posted by Galvatron at 10:42 PM on August 14, 2006


Why it has nothing to do with the audio: Flash does not have the capability to analyze the spectrum of a playing MP3 . The only way to have an accurate animation is to analyze the sound file a priori with third party software which generates the animation (or al least some values you can use for your animation). You then sync this animation with the sound playing.

You can do this on the server before serving the sound file, and it can get expensive.

Why it is still useful: My job is to develop rich media flash applications. Everytime we build an audio player, users demand a spectrum analyzer. We have found out that what really satisfies them is the coolest possible animation that lets them know that audio is supposed to be coming from the app. Ever tried to find out where the annoying music is coming from when you have 15 open tabs on firefox?

I guess the 'spectrum analyzer' format is common enough that most people instantly associate it with sound.

BTW, I just cooked up a fake analyzer, with five bars. It takes 1 graphic symbol and less than ten lines of actionscript. It increases application size less than 1k.

On preview: MeFi's spell checker has a silly spectrum analyzer that clearly does absolutely nothing related to my horrible spelling. Is there a reason?

posted by Dataphage at 10:44 PM on August 14, 2006


Look, no one said which spectrum it had to analyze.
posted by cortex at 10:47 PM on August 14, 2006


You don't understand. An MP3 file consists of nothing except spectrum coefficients. That's what the Fourier function used in the encoding process creates and what is stored in the MP3.

I have never looked at that page and I have no idea whether the spectrum analyzer being displayed there does anything meaningful, but if it is honest, it isn't expensive at all. It's virtually free. (In cycles, though not necessarily in terms of lines of code.)
posted by Steven C. Den Beste at 10:50 PM on August 14, 2006


I wish this had been on ask.mefi, cause Steven C. Den Beste just rocked out and it would have been nice to be able to flag it as such.
posted by shoepal at 10:59 PM on August 14, 2006


What's stopping you, shoepal?
posted by Gator at 11:01 PM on August 14, 2006


SCDB: If you are replying to me, yes, I do not understand an MP3. I am reading more right now, thanks for reminding me how fun it is to play with waves and Fourier. :)

What I do understand is that Actionscript, the language used by Flash, does not give you access to an MP3 innards. You can basically move the playhead and look at the ID3 tags.

If the player was programmed in any language that lets you actually look at the bytes in the file, then making an honest 'spectrum analyzer' would not be expensive.

What is expensive is building a custom .swf file on the server from and for every audio file that is being served. Specially for someone like Myspace, that must serve several million a day.

I think Mefites, being a lot more cultivated and intelligent that the average Myspace user, and harder to impress with shiny animations, could do without the fake animation, as lomng as there is an indication that a sound file is being played.
posted by Dataphage at 11:12 PM on August 14, 2006


OK, I'll bore you a bit more:

One of the black magicians tried to explain to me how the variable rate codecs work, and though I pretty sure I didn't fully understand it, I'll tell you what I got out of it:

There are 50 frames per second, 20 milliseconds per frame. The algorithm takes advantage of the fact that over a period of time that short, human speech is actually quite slow moving. Most frames consist of the same spectrum end-to-end, or close to it.

The encoder starts by doing an FFT on the waveform for that 20 millisecond period. Amazingly, it then does a table lookup. There's a predefined table, in firmware, of about a thousand entries which were, I think, created by statistically processing many hours of voice captured from real speakers of various languages. It finds the entry which is closest to the sound it needs to encode for that frame, and what it sends is the table index for that entry. (That's only possible because the codec is optimized for voice. It also means that these kinds of codecs don't do very well with sounds that don't sound like human voices, e.g. music.)

But the entry may be close or not very close to the sound which is actually needed. If it's quite close, the encoder chooses an 8th rate packet since all it needs to send is the table index. If it's not very close, it creates a quarter rate, or half rate, or full rate (i.e. progressively larger) packet and what the rest of the packet consists of is the changes that need to be made to the value of that table entry in order for it to correctly reproduce the sound to be sent.

The packet says "Start with this entry, and then change the following in it". The packet size is a function of how much change needs to be made to get the sound close to what it needs to be.

One of our standard tools permitted us to see the packet sizes in real time during a call. It was interesting to watch it. When the other guy is speaking, and you're listening, you're sending 8th rate packets the majority of the time. When you're speaking (or at least when I was speaking) the majority of the packets were quarter or half rate. Full rate packets were always a minority.
posted by Steven C. Den Beste at 11:12 PM on August 14, 2006 [4 favorites]


You are describing CELP, which has been around, in theory, since before I was born, it [...] starts with the assumption that the speech signal is produced by a buzzer at the end of a tube.
posted by Dataphage at 11:28 PM on August 14, 2006


Nothing like a good FFT at 2am... I knew about FFTs and like, from all my DSP coursework, which makes the dummy spectrum even that much more annoying, mocking us with false hopes of meaningful coefficients.
Thanks Dataphage, that is a good reason for it. However, it really takes up a not insignificant amount of graphical real estate. No need to cater to me over it, it just drives me kinda nuts when I'm using some lo-res systems to listen to the font of auditory joy that is Music. You'd be surprised how much of a 640x480 screen gets eaten by those randomly bumping bars.
posted by potch at 11:31 PM on August 14, 2006


So since we're on the topic, SCDB, answer me this.

If MP3s are "nothing but Fourier coefficients" and don't describe actual waveforms - why does the sampling rate (22khz, 44.1khz etc.) matter in MP3s? I've often wondered this.
posted by Jimbob at 11:48 PM on August 14, 2006


(which is to say - if each frame just has an equation describing the shape of the wave it needs to recreate, why do MP3s care about the sampling rate of the original wave? It's just drawing a curve, right? We don't need to know if that curve was originally described by 8000 points per second, or 44,000 points per second.)
posted by Jimbob at 11:50 PM on August 14, 2006


The black magician I was talking to told me that when they worked on a previous codec (which was eventually replaced by EVRC) they went out and found people in the company who had strange sounding voices, and who spoke other languages, and invited them into the lab to record some voice for the codec to work on.

There was one Chinese woman they got, and when she was speaking Chinese occasionally the codec would break up entirely, producing all kinds of garbage. They analyzed her voice and found out that she had been producing pure tones, sine waves. The prototype version of the codec said, "No! Human voices don't do that!" and couldn't deal with it.

Dataphage: CELP is a related algorithm, but not identical to EVRC. For one thing, CELP isn't variable rate, and the IS-95/CDMA2K was designed to rely on variable rate codecs in order to increase system capacity.

Jimbob, the 22KHz/44KHz sampling rate difference has to do with the Nyquist frequency. More or less, it isn't possible for a digital sampling system to reproduce any frequency greater than or equal to half its sampling rate. It can only reliably reproduce frequences less than half its sampling rate.

A 22 KHz sample rate means that the reproduced sound will cut off at 11 KHz. A 44 KHz sample rate permits reproduction up to 22 KHz. No human can hear 22 KHz, but most of us can hear higher than 11 KHz. (Generally kids can hear to the high teens; adults tend to cut off around 12-14 KHz.)

In terms of the FFT, if the sample rate is 22 KHz then the FFT cannot produce coefficients for any frequencies higher than 11KHz; the data isn't there to do it.

The 44 KHz rate specified in the original CD design was a case of slightly overspeccing, in order to make sure that CDs could reproduce high frequencies at least as well as vinyl. 44 KHz gave ample margin to reproduce sound up to 20 KHz, the upper bound of the traditional frequency range of home stereo (20-20KHz), which itself was chosen because no human can hear either end frequency.
posted by Steven C. Den Beste at 12:11 AM on August 15, 2006 [1 favorite]


Heh. I always wondered if you could compress sound through this exact method, without bothering to look into it at all. Turns out everyone else in the universe had the same idea.
posted by Ryvar at 1:37 AM on August 15, 2006


flagged as noise.
posted by davehat at 1:53 AM on August 15, 2006


damn you evariste, now he's never going to stop talking about codecs

Steven, now you're just fapping to the specter of people joining in your derail. Did you read Dataphage's original comments at all? The ones that plainly state that Matt's part of the flash is not touching the mp3 file at all? (it's just a wrapper with a few custom graphics around the built-in routine that streams the file and allows you to scrub through the track).

It's just a throbber in random-wave-form [lol] – it has absolutely nothing to do with the mp3 file or how it's encoded at all. It wouldn't behave any differently playing 4'33" than any other track.

A non-server-side possibility for a relevant visualizer – the swf could stream in the file first using http, then spoonfeed it to the stock mp3 streamer routine built into flash, and then manipulate the FFTs in advance of the part currently being played. But that would be retarded, annoying to program, and easily quadruple the size of the flash file (it's currently pretty damn minimal, making it feasible to have 40 of them on one page). And it would likely increase the computational load and memory consumption – ECMA languages don't handle this kind of stuff gracefully.
posted by blasdelf at 2:38 AM on August 15, 2006


If MP3s are "nothing but Fourier coefficients" and don't describe actual waveforms - why does the sampling rate (22khz, 44.1khz etc.) matter in MP3s? I've often wondered this.

The coefficients are relative. You need to know the sample rate or you will have no idea what a coeficient means. It isn't entirely unlike units - when I say "I'm going 100", 100 what (mph, kph, etc.)? If I have a coefficient from an mp3 file, I still need to know the sample rate in order to make it sound right.

This is made a little harder to understand because volume is relative. You don't need units for volume, you are going to set it on the output equipment. This is an artifact of how the ear works, you could have the frequency adjustable at the output end too, but that wouldn't sound right.

There is another non-difference that can add confusion. 22kHz implies a resolution as well as a scale fix (or units, or whatever you want to call that). Resolution isn't critical to reproduction, but it does tell you something about the amount of information in a recording. In an all digital recording you are sampling in two different axes, time and amplitude, so you need to know the resolution in both. Hence, you see 8-bit, 16-bit, or 24-bit and 22kHz, 48kHz, or 192kHz attached to a recording.
posted by Chuckles at 5:52 AM on August 15, 2006


Stephen - actually kids can hear pretty dang close to 20K, and sometimes above. In my early twenties, I could still hear 19K.

At 40, I cut off at 16K, but oddly enough there are holes in my hearing at 60Hz harmonics, which is what I get for spending a lot of time around computer fans, which oscillate in 60Hz multiples.
posted by plinth at 5:59 AM on August 15, 2006


You'd be surprised how much of a 640x480 screen gets eaten by those randomly bumping bars.

640x480?

There's probably a perfectly good reason, I figure, but yeesh.
posted by cortex at 6:17 AM on August 15, 2006




It is my understanding that the highest pure tone you can hear is not the same as the lowest frequency that can be excluded from a recording without any impact on the perceived sound.
posted by Chuckles at 6:54 AM on August 15, 2006


Plinth, I'm jealous. I'm unable to hear anything above 12K at ~30. Damn you rock concerts, constant cooling fan noises, and a childhood fascination with monster truck rallys. Damn you straight to hearing hell.
posted by togdon at 8:33 AM on August 15, 2006


Human speech is tremendously compressible because, if one looks at a spectrogram (with intensity values at each pair of (time,frequency)) of recorded speech, most of the acoustic energy is concentrated in only a few bands, called formants, of which there are only finitely many basic types. Moreover, they do not all vary independently all the time, and certainly don't vary all that quickly -- which is why the table entry + change information approach works as well as it does.
posted by little miss manners at 9:01 AM on August 15, 2006


Blasdelf, excuse me for breathing.
posted by Steven C. Den Beste at 10:26 AM on August 15, 2006


They analyzed her voice and found out that she had been producing pure tones, sine waves. The prototype version of the codec said, "No! Human voices don't do that!" and couldn't deal with it.

As a linguist I'd be awfully impressed if you could produce a recording of a sine wive generated by a human (even while not trying to speak) -- something moderately close maybe, but the physical properties of the vocal tract make it hard for me to see how this could ever happen, in any language.
posted by advil at 10:56 AM on August 15, 2006


even while not trying to speak

By which I mean even while just trying to produce a sine wave and nothing else, as opposed to say something in some language.
posted by advil at 10:58 AM on August 15, 2006


They thought the same thing, which is why their codec didn't handle it well and why they were so amazed.
posted by Steven C. Den Beste at 11:07 AM on August 15, 2006


Steven, would it kill you to think for a second before trying to sound like mister science? I've seen a couple of times where you "drop science" in a completely inappropriate context. For example, when someone asked on askme about dissolving wool, you explained that it was made of keratin and therefore dissolving it without dissolving plastic would be impossible, then you were immediately contradicted by other people's real-world experience. In another question about Relativity, you spouted off on a bunch of stuff that did absolutely nothing to answer the posters question.

In regards to your comment about how MP3s are compressed, that's entirely beside the point if you'd taken a bit of time to read and understand what was going on, you'd see that the poster wasn't complaining about how the spectrum analyzer, he was complaining that the spectrum analyzer did nothing at all, it's just a looping animation.

And while I don't know what flash is capable of, even the earliest versions of winamp had scope-mode visualizers An MP3 player must decode the Mp3 into a waveform, and therefore it must have access to the wave form to display on the screen.

Bleh.
posted by delmoi at 11:04 PM on August 15, 2006 [1 favorite]


Test for the highest frequency you can hear

/goes to site:

Forbidden
You don't have permission to access /ochenk/21000.wav on this server.


It's like that for all of them :P
posted by delmoi at 11:11 PM on August 15, 2006


Sorry delmoi, it worked a few weeks ago when I found it.
Try this from here and here.
posted by MetaMonkey at 3:48 AM on August 16, 2006


There's a predefined table, in firmware, of about a thousand entries which were, I think, created by statistically processing many hours of voice captured from real speakers of various languages.

It's bad enough to have human hair mixed into my soy sauce, now I've got real speakers of various languages sprinkled into my phone calls?
posted by StickyCarpet at 9:20 AM on August 16, 2006 [1 favorite]


« Older RSS Feeds   |   Please put your contact info in your profile so I... Newer »

You are not logged in, either login or create an account to post comments