I just got a project running whereby I used python + pdfplumber to read in 1100 pdf files, most of my humble bundle collection. I extracted the text and dumped it into a 'documents' table in postgresql. Then I used sentence transformers to reduce each 1K chunk to a single 384D vector which I wrote back to the db. Then I averaged these to produce a document level embedding as a single vector.
Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.
It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.
Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.
Apparently Persian and Russian are close. Which is surprising to say the least. I know people keep getting confused about how Portuguese from Portugal and Russian sound close yet the Persian is new to me.
Idea: Farsi and Russian both have simple list of vowel sounds and no diphtongs. Making it hard/obvious when attempting to speak english, which is rife with them and many different vowel sounds
Fascinating! How did you decouple the speaker-specific vocal characteristics (timbre, pitch range) from the accent-defining phonetic and prosodic features in the latent space?
We didn't explicitly. Because we finetuned this model for accent classification, the later transformer layers appear to ignore non-accent vocal characteristics. I verified this for gender for example.
Yeh, we would've loved to see that too. It's on our roadmap for sure. Same for some of the other languages with a large amount of unique accents like e.g. French, Chinese, Arabic, etc...
Good question! It's likely because there are lots of different accents of Spanish that are distinct from each other. Our labels only capture the native language of the speaker right now, so they're all grouped together but it's definitely on our to-do list to go deeper into the sub accents of each language family!
Not sure, could be the large number of Spanish dialects represented in the dataset, label noise, or something else. There may just be too much diversity in the class to fit neatly in a cluster.
Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.
This is a fascinating look at how AI interprets accents! It reminds me of some recent advancements in speech recognition tech, like Google's Dialect Recognition feature, which also attempts to adapt to different accents. I wonder how these models could be improved further to not just recognize but also appreciate the nuances of regional
I just got a project running whereby I used python + pdfplumber to read in 1100 pdf files, most of my humble bundle collection. I extracted the text and dumped it into a 'documents' table in postgresql. Then I used sentence transformers to reduce each 1K chunk to a single 384D vector which I wrote back to the db. Then I averaged these to produce a document level embedding as a single vector.
Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.
It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.
Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.
Going mono-tonal to that of an expressive ebook increased my "American English" score from a 52% to 92%.
I'd suggest training a little less on audio books.
The source code for this is unminified and very readable if you’re one of the rare few who has interesting latent spaces to visualize.
https://accent-explorer.boldvoice.com/script.js?v=5
Nothing too secret in there! We anonymized everything and anyway it's just a basic Plotly plot. Feel free to check it out.
Good catch. I really hate javascript so i never got into d3js, so plptly was such a life saver.
Plotly is great! Much love.
Apparently Persian and Russian are close. Which is surprising to say the least. I know people keep getting confused about how Portuguese from Portugal and Russian sound close yet the Persian is new to me.
Idea: Farsi and Russian both have simple list of vowel sounds and no diphtongs. Making it hard/obvious when attempting to speak english, which is rife with them and many different vowel sounds
Yeh they seem to be in the same "major" cluster, although Serbian/Croatian, Romanian, Bulgarian, Turkish, Polish and Czech are all close.
Turkish and Persian seem to be the nearest neighbors.
When I went to Portugal I was struck by how much Portuguese there does sound like Spanish with a Russian accent!
Part of this is the "dark L" sound
I’d guess that the sibilants, consonant clusters, and/or vowel reduction would play a big role.
I thought I was the only one who perceived an audible similarity between Portuguese and Russian.
Irish accent appears to break it.
We are working on this - we don't have quite enough Irish speech data.
Fascinating! How did you decouple the speaker-specific vocal characteristics (timbre, pitch range) from the accent-defining phonetic and prosodic features in the latent space?
We didn't explicitly. Because we finetuned this model for accent classification, the later transformer layers appear to ignore non-accent vocal characteristics. I verified this for gender for example.
it would've been nice to be able to visualize the differences between the different accents in the spanish language, really cool tho
Yeh, we would've loved to see that too. It's on our roadmap for sure. Same for some of the other languages with a large amount of unique accents like e.g. French, Chinese, Arabic, etc...
Thank you for sharing! the 3d visual was an interesting application of the UMAP technique.
Is there a way to subscribe to these blog posts for auto-notification?
Yeah, if only there was a protocol for that.
It would have taken you a second more to type out "RSS", and turn a sarcastic comment into an informative one.
Obligatory xkcd: https://xkcd.com/1053/
whats the dimensionality of the latent space? How were the 3 dimensions visualized selected?
12 layers of 768-dim each. The 3 dimensions visualized are chosen by UMAP.
why is spanish so distributed?
Good question! It's likely because there are lots of different accents of Spanish that are distinct from each other. Our labels only capture the native language of the speaker right now, so they're all grouped together but it's definitely on our to-do list to go deeper into the sub accents of each language family!
Not sure, could be the large number of Spanish dialects represented in the dataset, label noise, or something else. There may just be too much diversity in the class to fit neatly in a cluster.
Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.
This is a fascinating look at how AI interprets accents! It reminds me of some recent advancements in speech recognition tech, like Google's Dialect Recognition feature, which also attempts to adapt to different accents. I wonder how these models could be improved further to not just recognize but also appreciate the nuances of regional
really fun discovery clicking a dot and hearing the accent. neat visualization, lots to think about!