How should database models be persisted?

The conventional way to persist the semantic data in a graphical database design/modeling tool is as a single binary file. This is easy to implement but does not mesh well with the text oriented tools developers typically use (e.g. editors, source code control systems, file diff and file searching utilities). In addition, if something goes wrong recovering your data is difficult. Even if the file is persisted as XML, it requires users to understand and work with a new file format. I think a better way to persist semantic information is as a small number of T-SQL creation scripts. Information, like graphical layout, that has no T-SQL equivalent should be stored in a single XML file. This allows users to edit, search and compare versions of the database model using standard tools. Most importantly, users can easily resolve change conflicts using standard tools since the model is stored in a natural and familiar form (i.e. T-SQL).

This approach worked well in the design of the new Visual Studio 2005 class designer. The main issues we encountered in the design was the cost of parsing a large amount of source at startup and the problem of what to do with class shapes that no longer mapped to classes in the source.

We solved the problem of parsing large amounts of source by careful design and performance engineering as well as developing an infrastructure that delayed loading data until it was actually required. Delay loading data is a common technique in distributed systems but unusual in modeling tools.

If users change, say, a class name without the class designer open then, when the class designer is next opened, the corresponding class shape needs to be mapped to the new class. Because classes do not change much, we can usually match the class shape to the renamed class. Because several classes may be changed, we developed an efficient algorithm that makes the best overall choice. However, sometimes there is no matching class. In those cases, we show the class shape in a special way and the user can either manually match the class to the corresponding class shape or just delete the shape. The approach is not perfect but the pragmatics of how people develop software means that it works almost all of the time without affecting existing development processes.

Weighing against this view is that database modeling customers accept the idea of T-SQL creation scripts as something that a tool generates instead of the fundamental objects that the tool manipulates. Even for those customers, I think the ability to have several people working on the model at once and less data loss will be compelling advantages.

Given the desire to persist database information as natural T-SQL scripts, how should the T-SQL scripts be structured? There seems to be two basic ways that T-SQL developers factor their scripts—as several large files containing scripts for related objects, or many small files containing scripts for individual objects. I prefer the approach of files containing several related objects but plenty of smart people favor the second approach. I know of one team, at Microsoft, where even the primary key creation script for a table gets its own file. However, I have never heard a good explanation as to why some prefer many small files.

My theory is that the small files approach is a historical artifact of having to work with primitive, or nonexistent, source code control systems that did not allow simultaneous changes to a single file. Without modern tools, the cost of integrating changes would be large enough that many small files would make a lot of sense. With modern tools, the benefits of easy navigation between related objects outweigh the cost of integrating changes.

From a tools perspective, it generally best to generate only a small number of files since the cognitive burden on the user will be lower. If users mostly edit database creation scripts through a graphical tool then a single file that reduces file management issues becomes even more attractive.

I know I am probably missing something so I would appreciate hearing from people on why they prefer one approach or the other. In particular, I would like to hear why they prefer the many small files approach.

 

This posting is provided "AS IS" with no warranties, and confers no rights.

Comments

  • Anonymous
    September 21, 2005
    Hi Anthony,

    I believe there is a lot to be gained from using XML as the persistence format for database models. An XML model is perfect for post-processing to generate documentation and using as the input to code generation tools, etc. I would not like to even consider using raw T-SQL as the input to code gen!

    As you described above, a T-SQL model would need an additional file (XML) to hold designer information that couldn't be stored as T-SQL. Why create two files when an XML file could capture everything?

    The other downside I see to using T-SQL is that it ties the designer to SQL Server. This may of course be intentional but it would prevent models from using constructs which can not be captured in T-SQL.
  • Anonymous
    September 22, 2005
    Thanks for your feedback Lee. Using T-SQL creation scripts as a persistence mechanism does not prevent you from exposing a rich object model. Nor does it prevent you from generating XML reports, database change scripts, etc.

    The nice thing about the DSL framework that the class designer is built on is that you can easily link DSLs, via their object models, together to do interesting things. For example, from a database model you could link a class diagram that represented the LINQ classes to access the database. As you update the table definitions the classes could be automatically updated.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Anonymous
    September 27, 2005
    The comment has been removed