The IronPython</a> project is looking at moving and completely using GitHub for all project information: downloads, issues, wiki, etc. The main problem is that IronPython currently resides on CodePlex and CodePlex, sadly, does not provide an API for accessing anything. This means we need to use screen scraping to get the job done on the CodePlex side. On the GitHub side, they have a wonderful API</a> that is very well documented and has libraries for many languages. BeautifulSoup</a> is a library I have previously used for screen scraping from Python and it was a great experience, its a simple to use library.
Some goals for the script based on feedback from the project:
- Maintain history (comments) as much as possible</li>
- Maintain component notations</li>
- Maintain releases</li>
- Migrate both open and closed issues</li>
- Migrate attachments if possible</li>
When doing screen scraping, I really like to use the developer tools from whatever browser I am using (usually Chrome) in order to make viewing the source and finding patterns in the HTML easier. I decided to scrape the information I needed in a couple different steps. I could get some of the information from the list of issues, but then I would also need to go to each individual issue page and scrape information from there.
I decided to use the Advanced view for the bug tracker on CodePlex because it had a lot of information that I could pull out right from the get go.
[caption id="attachment_143" align="aligncenter" width="755"]</a> IronPython advanced view for issues.[/caption]
You can see that we can get information like ID, Title, Status, Type, Priority and last update (though the last update wasn't really useful). It was also possible to grab the link for the specific issue for use later.
One thing I did when writing the script was setup the filters and sorting the way I wanted prior to grabbing the soup and then I used the direct link that can be found on the page to get the issues in the order I really wanted.
As you can see from the screenshot below, if you use the "Inspect Element" in Chrome it will show you the structure for each row in the list of issues.
[caption id="attachment_145" align="aligncenter" width="912"]</a> Row information from advanced view.[/caption]
Each row of the advanced view has several pieces that we can pull out, and each row starts with "row_checkbox_" this makes it very easy to loop through each row using BeautifulSoup.
[gist id=4403233 file=codeplex_row_checkbox.py]
Each row can have information about who the issue is assigned to, if its currently closed or not as well as a link to the actual individual issue page that we will need later. I grabbed all this info and put it into a sqlite database so that I could update it once I parsed the individual issue page.
[gist id=4403233 file=codeplex_row_info.py]
GitHub treats the severity and type of issue as labels, so I add the severity and type to the issue_to_label table with a foreign key into the issues table, this makes it easier later to add all the labels necessary. CodePlex will only show up to 100 items per page, so I regenerate the direct link with info on which page I want and parse each page to get all the issues.
Now that I have all the issues in a database, I select them all and iterate through them to parse the individual issue pages to grab all the information.
[gist id=4403233 file=codeplex_iterate_issues.py]
One thing to note here is that I actually used a different HTML parser for BeautifulSoup in different parts of the script. When parsing the Advanced View, I used "html5lib", but while parsing the individual issues, I used the "html.parser." The reason for this is because each parser treats uncompleted tags differently, one of them adds additional tags to make up for missing tags, the other does not. The HTML generated by CodePlex had some weirdness in the area of the descriptions of the issues, so using "html.parser" cleared some of those issues up and made the soup easier to work with.
While parsing each issue, there were four main areas that I wanted to get information from:
[caption id="attachment_152" align="aligncenter" width="758"]</a> Areas of interest[/caption]
The description was pretty straight forward, I looked at the HTML for that area and found the following:
[caption id="attachment_153" align="aligncenter" width="885"]</a> HTML for issue description area.[/caption]
This was pretty easy to grab from the soup, but then I had an issue that there is possibly markup in the description content (bolds, italics, etc). So, I decided I would use the html2text module to convert the description into valid markdown that could be used directly on GitHub.
[gist id=4403233 file=codeplex_description_html2text.py]
The attachments were also pretty easy, each one had a specific id that could be pulled out using BeautifulSoup:
[gist id=4403233 file=codeplex_issue_attachments.py]
Comments were a little trickier, they had several bits of information that I was interested in. I wouldn't be able to maintain the person who made the comment on GitHub, but I wanted to keep when the comment was made and who made it, and add these items as comments on the GitHub issues, in the order they were made on CodePlex.
[gist id=4403233 file=codeplex_issue_comments.py]
As you can see, with the understanding of how the HTML is put together, it is pretty easy to pull our the information you are interested in and even though CodePlex doesn't have an API, they do put a lot of information into the HTML of the issues that can be parsed out.
The metadata area was also fairly well structured. It is just a table contained within a div with the id "right_side_table," and looping through the tr elements and pulling out the info is a piece of cake again.
[gist id=4403233 file=codeplex_issues_metadata.py]
Some of the metadata was used to update fields for the issue itself, but the rest were added to the description under a header "Work Item Details" to maintain the history of the information when the issue was moved from CodePlex to GitHub.
Once all the data was put into the database, it was pretty easy to import into GitHub using the PyGithub</a> module. The one bad thing about this module is the lack of good documentation. I had to figure a few things out by just looking at the source code as well as looking at the GitHub API documentation to see what was possible with the different API calls.
Since the GitHub part is really easy to comprehend and the majority of this article was to talk about screen scraping, I will just provide the code for the script in the gist below.
The end result of the imported issues list can be seen below on a practice run on GitHub.
[caption id="attachment_158" align="aligncenter" width="725"]</a> Issues after being imported to GitHub[/caption]
The severity (high, medium, etc.), the type (task, feature, etc.) and the component are all turned into labels with nice color coding on some of them.
The script migrates any plaintext attachments over as Gists and then puts a link to the Gist in the description area. Binary attachments are left on CodePlex and linked to directly. It would be better to have everything in one place, but GitHub doesn't really have a good way of attaching binary items to tickets (or any attachments at all in fact).