ToneClone
Provide future guitarists with the tools to effortlessly replicate tones through machine learning and remove barriers to creativity.
Problem and Motivation
Many beginning guitarists struggle to achieve the sounds they hear in professional recordings. Learning to play is already hard, but understanding how effects shape tone — and how to use them effectively — is even harder. Effects such as distortion, delay, reverb, and modulation effect are used to create professional sounds but can be confusing to new players.
The goal for our product, ToneClone, is to address this challenge by analyzing guitar audio to identify the effects used and provide accessible, tailored guidance to educate guitarists about effects. This approach allows users to bridge the gap between hearing and recreating professional-quality sounds, offering a unique combination of analysis and education that is not available in other products. A simple and intuitive method for new guitar players to analyze guitar tones and receive a step by step instruction to replicate.
Solution
Our team has created a simple, user-friendly web application that beginner guitar players can use to learn about different guitar effects used in their favorite songs. Users can upload a .wav file of the guitar segment or full song they would like to learn about. If there is a certain segment they are interested in, they can use the cropping feature to only analyze that portion of the song. Once the upload process is complete, users can click on the classify button.
Under the hood, the .wav file is converted into 10 second spectrograms which are then ultimately represented by a single numpy array. This array is sent up to our custom sagemaker endpoint which hosts our fine-tuned PANN model. Once the input is fed through the model, the endpoint returns the predictions. The predictions are then processed, thresholded, and fed to ChatGPT to provide dynamic user feedback. Ultimately, the output serves the user the top three effects found in the submitted segment, a timeline of where those effects are found in the song, and further descriptive information about each effect such as famous songs using those effects and recommended effect pedals.
Figure 1: ToneClone Architecture
Data Source
What makes ToneClone possible is the creation of a new, synthetic dataset of labeled guitar effects. We started with publicly available guitar arrangements and converted them to MIDI. These tracks were then processed through a high-quality virtual guitar instrument to generate realistic, clean guitar recordings. Next, we applied a wide range of digital effects and labeled them for later model training.
This dataset includes 100 songs, each processed with 45 different effect combinations, resulting in more than 400 hours of data.
Figure 2: Example Spectrogram
Figure 3: Data Pipeline
Data Science Approach
Modeling Options
Custom CNN
Our initial approach involved three different modeling avenues. We first built a custom CNN and experimented with various combinations of layers, weights, etc. We evaluated the performance of these custom models on all of our effect combinations and fine tuned the model based on lower performing effects.
Fine-Tuned Pre-Trained Audio Neural Network
Our second approach involved using a pre-trained model called PANN1 (Pre-Trained Audio Neural Network). This model was trained on ~5800 hours of audio and used a similar CNN architecture. We fine tuned this model to our specific use case and once again evaluated the result on our various effect combinations.
Evaluation
With both the custom CNN and fine-tuned PANN we were able to produce high F1-scores as shown in the results below. Ultimately, the fine-tuned PANN model performed slightly better compared to our baseline and other models so our team decided to go forward with that model for the MVP.
Accuracy | F1-Score (weighted) | Precision | Recall | |
Random Chance (Baseline) | 0.0222 | 0.1092 | 0.1626 | 0.0844 |
Simple CNN | 0.6176 | 0.7824 | 0.8477 | 0.7934 |
Pre-Trained PANN w Class. | 0.2684 | 0.5379 | 0.7701 | 0.4639 |
Best CNN | 0.8279 | 0.9168 | 0.9083 | 0.9345 |
Best Fine-Tuned PANN | 0.8729 | 0.9384 | 0.9238 | 0.9609 |
The most comparable model in existing research to our models was resnet182 (f1-score: 0.9292). However, that particular model only predicts a single note or chord which is a much simpler task compared to ToneClone which can analyze full songs or segments of songs.
Key Learnings and Impact
Throughout this project, our team learned a lot about the feasibility and effectiveness of ToneClone. Early on, we realized that even a simple CNN architecture performs pretty well on the task at hand. By building off of existing work such as the PANN model and other research into the classification of guitar effects, we were able to create a working prototype that moves past just classification single notes or small segments to full guitar songs. Troubleshooting our various models and the integration into a web application was a process. Eventually, we settled on a simple and intuitive interface that can hopefully convey complex information in a digestible way for beginner guitar players.
We also received a lot of enthusiasm from various SMEs and potential users for the potential effectiveness of our product. Ultimately, ToneClone strives to help beginner guitarists not only replicate tones but develop their own tones. We want to help provide the musical language and necessary tools to allow guitarists to explore. The application doesn't promise to give the exact effect combination for every inputted song, but it will help provide guidance and knowledge towards achieving that goal. Learning guitar can be a daunting task, especially when done alone. ToneClone strives to reduce that learning curve one song at a time.
Acknowledgements
Our team would like to thank Thom Planert and Blake Ricks for their feedback and insight during SME interviews. Their input was invaluable in the development of ToneClone and also provided us confidence that we were solving a real problem for guitarists. Big thanks to Joyce Shen and Zona Kostic as well for their guidance throughout the Capstone semester.
References
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M. D. (2019, December 21). PANNS: Large-Scale pretrained audio neural networks for audio pattern recognition. arXiv.org. https://arxiv.org/abs/1912.10211
Guo, J. and McFee, B.. (2023, September 7). Automatic Recognition of Cascaded Guitar Effects. DAFX 2023. https://www.dafx.de/paper-archive/2023/DAFx23_paper_30.pdf