Generate Piano Music Using Deep-Learning
Part-2: Obtain music data using simple web scraping technique via Python
Background
Previously we have shown readers how to setup the development environment, lets continue to show how we can get music data online using web scraping. Training a deep learning model from scratch usually requires a lot of data. If you don’t have all of them handy, an intuitive idea may be to get them online. And web scraping is one of Python’s strong suits. Please note, do so cautiously. What we discuss here is for research purpose only, the readers should always respect any relevant copyright involved when practice them, and not to abuse any website when scraping.
Music Data Format
There could be different formats available online for any given piece of music we interested in, and we focus on “MIDI” with 2 main reasons:
- We will need to convert the music into “sheet notes” later on, so that we can employ Natural Language Processing (NLP) alike methods when training the deep learning model. And MIDI file carries all the key information we will need later on, such as the type of instrument, the beat and the note etc.
- MIDI file is small, which is an advantage when we want to obtain and process a lot of them.
- Readers can certainly learn more about MIDI if interested: https://en.wikipedia.org/wiki/MIDI
Web Scraping
The Aim
There are many good Python packages that can do web scraping, and it may ask for very seasoned developer skill to output a fully robust and well automated program. To lower the complexity, we will be targeting one very specific website: http://www.piano-midi.de/beeth.htm, which has kindly provided us with several pieces of Beethoven (and other master pieces if readers interested). Instead of having to manually download these MIDI files one by one (by clicking our mouse and keyboard many times), we aim to develop a little web scraping function, which would allow us to download all the MIDI files from this web page and save into a nominated folder within our project folder, automatically. Again, we strongly urge the readers to follow and honour the copyright of the website here: http://www.piano-midi.de/copy.htm.
Basic way of working
- We have created a project folder named “deep_piano” by now.
- Let’s create a sub folder named “utils” (under project folder). In this “utils” folder, we shall create a new file named “__init__.py” and leave it as empty (note the two consecutive underscore, prior and after, in the naming).
- The above setup will enable some very convenient way to manage our project, which we will show below.
The Code
1. First we will create a Python file named “web_scrape.py” under the “utils” folder. Within this file, we will define the below “get_midi” function:
from bs4 import BeautifulSoup
from urllib.request import urljoin
import requests, os, progressbardef get_midi(target_url, midi_dir):
'''
target_url: the web page we want to get the MIDI from
midi_dir : the folder where we will save the MIDI into
''' # if output dir not exist, create one
if not os.path.isdir(midi_dir):
os.mkdir(midi_dir)
# Try web scraping
try: midi_count = 0 # establish connection to the target web page using BeautifulSoup
page = requests.get(target_url)
data = page.text
soup = BeautifulSoup(data) # print out progress using ProgressBar
with progressbar.ProgressBar(max_value=progressbar.UnknownLength) as bar: # loop through all the links within the page and see which is MIDI file
for link in soup.find_all('a'):
current_link = link.get('href') # if the link is MIDI file, download to local folder
if current_link is not None and current_link[-4:] in ['.mid', '.MID', '.midi', '.MIDI']:
# get the absolute url link to the midi file
if not current_link.startswith('http'):
sub_url = urljoin(target_url, current_link)
else:
sub_url = link.get('href') # download the midi file
r = requests.get(sub_url, allow_redirects=True)
midi_name = sub_url.split('/')[-1]
open(os.path.join(midi_dir, midi_name), 'wb').write(r.content) # count and update
midi_count += 1
bar.update()
print('In total {} midi files downloaded'.format(midi_count))
# if web scraping failed, give user warning and return
except Exception as e:
print(e)
print('web scraping of {} failed :('.format(target_url))
We have put sufficient comments within the above code, which should be self-explainable enough for the readers to understand it step by step. Some minor call outs:
- This code is only tested against the specific website we are targeting. We will not be too surprised if it failed when run directly against another website. However, the code should provide an example “framework” for readers to try make it more robust for any other website they want to try with. Some Python coding knowledge is required of course if the readers wish to do that, which should not be too hard with the help of Google.
- We only search the website itself, assuming the midi files are available from it directly. In case the midi files are from child links of the website, the current code will not be able to find them. To do that, readers need to extend our code to make it search (crawl) the website recursively (deeper and wider), which is beyond the scope of our discussion here.
2. With the “get_midi” function ready, we can now create the “main.py” program file under the project folder “deep_piano”, with below code to actually execute the download:
from utils.web_scrape import get_midi
import ostarget_url = 'http://www.piano-midi.de/beeth.htm'
midi_dir = os.path.join(os.getcwd(), 'midi_download')
get_midi(target_url, midi_dir)
When we executed the above code (either from Jupyter Notebook, or from the command prompt window as python main.py
), the program will download 58 midi files to the folder “midi_download” we nominated. The number may vary if the website was updated when the readers do it.
The organisation of our project folder will look like something below:
.deep_piano
├── main.py
├── utils
│ ├── __init__.py
│ └── web_scrape.py
├── midi_download
│ ├── midi_file_xxx.midi
And because the way we have setup the “utils” folder earlier (remember the __init__.py file), it allows us to make use of the “get_midi” funciton as simple asfrom utils.web_scrape import get_midi
.
There are certainly room to improve the code. For example, it downloads the midi files one by one at the moment, which may be optimised if the readers can refactor the code to allow parallel downloading.
Next time we will discuss how to prepare the midi files for deep learning.