Project: Language Syntax Comparison - XML vs Database
When people are approaching a new problem, they tend to want to solve it using the tools they are most familiar with. In this case, I can recall using XML as a manner of sending data and configurations in a number of projects I've developed in the past. It is something I've worked with before and the text document like data I'm working with seems like it would fit well. Once I started thinking about scalability however, the limitations of an xml based plan became apparent.
The original plan of attack for this project was to take the existing format from the original web page and turn it into an xml. Then in the future, I could set any two languages' xml next to each other in a table and compare them. If I wanted to add new languages in the future, I would only have to create the data once before being able to generate web pages that compares it with all of the other languages. If I had 10 different languages in 10 xmls, I would be able to generate up to 90 different pages (Ignoring duplicates based on ordering).
Since the data is in a hierarchical form that is always displayed in the same order, it seems like this would match up nicely to an xml format. The problem is not all languages contain all of the same functionality, and there are plenty of languages I am not familiar with, or which haven't even been invented yet that may have new functionality which doesn't exist in my current schema.
So while xml is great at defining the existing data, it would be awkward later on when I need to change or add new data.
After consulting with a friend, I've changed my approach and plan to use a database now. Creating a database for this purpose will take a little more work at the start, but will hopefully make things a lot easier later on.If I design the tables right, I will be able to add new sections by simply adding a new row, without ever needing to touch the code or the existing languages.
I spent a bit of time figuring out just which kind of queries the web page was going to need, and how the tables should be structured in the database. One of the things I became concerned about however, was the performance hit based on the number of database queries I would have to make to generate the page. Depending on how complex of a join I could return, creating the page could take as many as twenty some queries to generate.
So because of that, I am now considering the idea of caching the pages. The site should only ever need to generate a new page if the database rows for a language have been updated or added since the last time the page was created. This does mean that if I have 10 languages, I will need up to 90 different pages cached on my site. The size of an html page should be relatively small though. These kind of website management and performance issues aren't thing's I've had to deal with personally before so it will take a little time to figure out what the standard efficient ways are for implementing this; but it seems like a smart approach.
With all that in mind, my current database table plan is looking like this.
- Computer_Language (Lang, ID, last_modified)
- Cached_pages (Lang1, Lang2, Page_Name, last_modified)
- Major_categories (Category, Order_ID)
- Major_Category_language (Lang, Category, exist?, comment_text)
- Sub_Section (Major_Category, Sub_section, Subsection_Order_ID)
- Code_Blocks (Lang, Major_cat, Subsection, exist?, formatted_html, last_modified)
The basic structure is there, all that is left is fixing the naming conventions and setting up data types and primary/foreign keys. The one thing I'm still questioning is the matter of the "exist?" column and the Major_Catagory_language tables. The issue at hand is a functionality question about what I do when I cam comparing two languages where either one or both of them don't have syntax for a specific section.
If both languages don't have support for functionality from a major category, such as threading, then there is no point in including the threading section on the page at all. However, if there is a case where one language does have a specific subsection while the other one doesn't, that is a distinction I still want to make. For instance, on the existing syntax comparison web page where it compares C# to VB.NET, the Classes / Interfaces category has a section that defined VB.NET's use of the key word Module; on the C# side it is commented "No Module equivalent - just use static class".
Basically, each language needs to provide a way to represent that a specific subsection has no equivalent in the language, while still providing a text/comment for that particular row in the database in case there is an exception or workaround that needs to be pointed out.
The other issue is that, in the future when I add a new section to one language, until I go back and update all the other languages those rows won't exist and/or will be NULL until data is provided. That probably has to be handled on the application side, though it would be nice if I can ensure that there is always a row for every sub catagory / language combination.
Design documentation - Data Entry
I spent a good part of my work last week taking apart the language syntax format and trying to figure out just how many rows I wanted to divide it up into. Editing the existing html by hand, I came up with a baseline sheet which has all of the separate code blocks in their own table cells, with the names of the cell's I want to use on the right hand side. This will act as the model for the database content once the tables are created.
There were 128 different sub sections that require content. That means every single language will have 128 different rows containing the text and code blocks used to compare the syntax. There were also 26 major categories (such as Program structure, Data Types, Functions, ect) that will act as the headers which will break up those 128 different cells.
At one point this number was closer to 200; but at a certain point it did not make sense to break down each individual sub section. There are already sections that require being written in a 'hello world' or 'student gpa' type structure in order to consistently demonstrate the context syntax across languages. In which case, it is not to much to ask that data types or arithmetic operators be inserted in a specific order each time, rather then creating a separate row for every single one. Those sections will end up a lot cleaner and more condensed as a result of a little more demand on the administrative input.
I also ended up creating a 20 page design document, which is admittedly a little more formal then functional. Still it helps to have all of the design requirements down in one place, and will act as detailed documentation in the future. The design doc proved useful when consulting with other people about possible design implementations to use.
Regroup - Future plans
Using my current design template I should be able to get some preliminary data into the database from one of the languages to act as a template. This takes care of the 'schema design' part of the project, but there is still a lot to do:
- Teach myself php. Focusing on database connections and dynamic html creation.
- Program a page that can generate data from the database into an html table
- Program some kind of input application form to allow adding or editing languages in the database
- Set up the css, html header/footer, and other persistent parts of the web page
- Deal with optimization and performance issues for both the DB and web page.
- Figure out how to set up and execute proper page caching on my site
- Populate the database with at least 5 languages, starting with the existing 3 from the original site.
- Clean up the page, make it look good, test for bugs. (Test adding new sections, categories, languages)
I have a dilemma.
On one hand, I could treat these projects as if I was under a professional deadline, meaning I would be much more likely to put out rapid prototypes to get results out there; but I would have to learn from my mistakes and potentially have some bad designs built into my work because I didn't know any better at the time. Makes for a better blog post and it is good practice for real development times, but results in shoddier work and bad habits.
On the other hand, since there isn't a deadline I can take the time to learn the content and program the project right from the stat. That way I'd be better able to do the work quickly AND accurately in the future when I do have a deadline. The problem with that method is the danger of letting my deadlines constantly slip and never producing anything in a timely fashion.
I ended up spending two and a half days last week learning from a deep MIT database course. While it has helped me get a much better conceptual understanding of the systems I am working with, none of that study led to any progress on the project. Even though I have experience working with complex database systems before, I'd never had to understand deep administrative concerns for creating one. It might be overkill for the scale of my project now, but I want to make sure I do understand the low level fundamentals when I do encounter the more complex problems. Plus it makes for better coding and design habits.
For now, I'll stick to my schedule of a project blog sometime in the first half of the week, and an op-ed article in the last half. I'll just have to keep that balance of academics vs production in mind for the future.