[ad_1]
A brand new unsupervised methodology that mixes two ideas of vector quantization and space-filling curves to interpret the latent area of DNNs
This publish is a brief rationalization of our novel unsupervised distribution modeling approach referred to as space-filling vector quantization [1] revealed at Interspeech 2023 convention. For extra particulars, please have a look at the paper below this hyperlink.
Deep generative fashions are well-known neural network-based architectures that study a latent area whose samples could be mapped to wise real-world knowledge resembling picture, video, and speech. Such latent areas act as a black-box and they’re usually tough to interpret. On this publish, we introduce our novel unsupervised distribution modeling approach that mixes two ideas of space-filling curves and vector quantization (VQ) which is named House-Filling Vector Quantization (SFVQ). SFVQ helps to make the latent area interpretable by capturing its underlying morphological construction. Essential to notice that SFVQ is a generic software for modeling distributions and utilizing it’s not restricted to any particular neural community structure nor any knowledge sort (e.g. picture, video, speech and and many others.). On this publish, we exhibit the appliance of SFVQ to interpret the latent area of a voice conversion mannequin. To know this publish you don’t must find out about speech alerts technically, as a result of we clarify the whole lot on the whole (not technical). At the start, let me clarify what’s the SFVQ approach and the way it works.
House-Filling Vector Quantization (SFVQ)
Vector quantization (VQ) is an information compression approach much like k-means algorithm which might mannequin any knowledge distribution. The determine beneath reveals a VQ utilized on a Gaussian distribution. VQ clusters this distribution (grey factors) utilizing 32 codebook vectors (blue factors) or clusters. Every voronoi cell (inexperienced traces) comprises one codebook vector such that this codebook vector is the closest codebook vector (by way of Euclidean distance) to all knowledge factors situated in that voronoi cell. In different phrases, every codebook vector is the consultant vector of all knowledge factors situated in its corresponding voronoi cell. Subsequently, making use of VQ on this Gaussian distribution means to map every knowledge level to its closest codebook vector, i.e. symbolize every knowledge level with its closest codebook vector. For extra details about VQ and its different variants you may take a look at this publish.
House-filling curve is a piece-wise steady line generated with a recursive rule and if the recursion iterations are repeated infinitely, the curve will get bent till it utterly fills a multi-dimensional area. The next determine illustrates the Hilbert curve [2] which is a widely known sort of space-filling curves wherein the nook factors are outlined utilizing a particular mathematical formulation at every recursion iteration.
Getting instinct from space-filling curves, we are able to thus consider vector quantization (VQ) as mapping enter knowledge factors on a space-filling curve (moderately than solely mapping knowledge factors completely on codebook vectors as what we do in regular VQ). Subsequently, we incorporate vector quantization into space-filling curves, such that our proposed space-filling vector quantizer (SFVQ) fashions a D-dimensional knowledge distribution by steady piece-wise linear curves whose nook factors are vector quantization codebook vectors. The next determine illustrates VQ and SFVQ utilized on a Gaussian distribution.
For technical particulars on the best way to practice SFVQ and the best way to map knowledge factors on SFVQ’s curve, please see part 2 in our paper [1].
Be aware that after we practice a traditional VQ on a distribution, the adjoining codebook vectors that exists contained in the discovered codebook matrix can discuss with completely totally different contents. For instance, the primary codebook component might discuss with a vowel telephone and the second might discuss with a silent a part of speech sign. Nonetheless, after we practice SFVQ on a distribution, the discovered codebook vectors might be situated in an organized type such that adjoining parts within the codebook matrix (i.e. adjoining codebook indices) will discuss with comparable contents within the distribution. We will use this property of SFVQ to interpret and discover the latent areas in Deep Neural Networks (DNNs). As a typical instance, within the following we are going to clarify how we used our SFVQ methodology to interpret the latent area of a voice conversion mannequin [3].
Voice Conversion
The next determine reveals a voice conversion mannequin [3] based mostly on vector quantized variational autoencoder (VQ-VAE) [4] structure. In response to this mannequin, encoder takes the speech sign of speaker A because the enter and passes the output into vector quantization (VQ) block to extracts the phonetic info (telephones) out of this speech sign. Then, these phonetic info along with the id of speaker B goes into the decoder which outputs the transformed speech sign. The transformed speech would include the phonetic info (context) of speaker A with the id of speaker B.
On this mannequin, the VQ module acts as an info bottleneck that learns a discrete illustration of speech that captures solely phonetic content material and discards the speaker-related info. In different phrases, VQ codebook vectors are anticipated to gather solely the phone-related contents of the speech. Right here, the illustration of VQ output is taken into account the latent area of this mannequin. Our goal is to exchange the VQ module with our proposed SFVQ methodology to interpret the latent area. By interpretation we imply to determine what telephone every latent vector (codebook vector) corresponds to.
Deciphering the Latent House utilizing SFVQ
We consider the efficiency of our space-filling vector quantizer (SFVQ) on its means to search out the construction within the latent area (representing phonetic info) within the above voice conversion mannequin. For our evaluations, we used the TIMIT dataset [5], because it comprises phone-wise labeled knowledge utilizing the telephone set from [6]. For our experiments, we use the next phonetic grouping:
- Plosives (Stops): {p, b, t, d, ok, g, jh, ch}
- Fricatives: {f, v, th, dh, s, z, sh, zh, hh, hv}
- Nasals: {m, em, n, nx, ng, eng, en}
- Vowels: {iy, ih, ix, eh, ae, aa, ao, ah, ax, ax-h, uh, uw, ux}
- Semi-vowels (Approximants): {l, el, r, er, axr, w, y}
- Diphthongs: {ey, aw, ay, oy, ow}
- Silence: {h#}.
To research the efficiency of our proposed SFVQ, we cross the labeled TIMIT speech information by means of the skilled encoder and SFVQ modules, respectively, and extract the codebook vector indices equivalent to all present telephones within the speech. In different phrases, we cross a speech sign with labeled telephones after which compute the index of the discovered SFVQ’s codebook vector which these telephones are getting mapped to them. As defined above, we anticipate our SFVQ to map comparable phonetic contents subsequent to one another (index-wise within the discovered codebook matrix). To look at this expectation, within the following determine we visualize the spectrogram of the sentence “she had your darkish swimsuit”, and its corresponding codebook vector indices for the peculiar vector quantizer (VQ) and our proposed SFVQ.
[ad_2]