The eye sees only what the mind is prepared to comprehend

Friday, August 24, 2007

Code depreciation

Depreciation is a term used in accounting, economics and finance with reference to the fact that assets with finite lives lose value over time (from Wikipedia), I this case, I'd like to discuss code depreciation and its relationship with optimization and mantainability by proposing the following rule:

If code is written for performance (compromising maintainability), the value of such performance optimization (and maintenance degradation) will depreciate, because the likehood of having either faster hardware or different developer maintaining the code increases as time passes.

On the other hand, maintaintable code, increases its value for similar reasons:

If code is written for maintainability (postponing performance enhancements), the value of that maintainability (and performace degradation) will increase its value, because the likehood of having either faster hardware or different developer maintaining the code increases as time passes.

I added this definitions to C2... I wonder how (or if they) will evolve.

Wednesday, August 22, 2007

Data Transfer Object Injection

Data Transfer Object injection is a programming error which results in security holes., it is to a Remote Object Service based applications which use object graphs what SQL Injection is to web-based applications which use databases.

DTO injection could happen where there is a remote object service that allows a client system to send and and object graph that is automatically converted by an object relational mapper in to SQL statements.

Instead of sending a valid object graph, the attacker can send a different object graph, representing alterations to the database that go well beyond his security level. For example, a remote object service receives an object graph that represent changes in the objects that represent new users, or new permissions granted for existing users of the system.

To prevent this problem it should be possible to specify at the object relational mapping level, which entities can be saved by the current user... many object relational mappers, or xml relational mappers automatically write the changes represented by the object graph to the database, without caring if the current application user has the privileges required to persist those objects... we can not rely on RDBMS security, because most remote object services use the same user for all the calls... and I think it that connecting with a different user for each remote object service would be bad for connection pooling (decreasing performance)

I wonder if anyone else thinks this is a common security problem... Mmmm... I will add this to C2... I wonder how (or if it) will evolve.

Saturday, August 11, 2007

REST DataService... it was so... obvious?

Hi!
When I built my first systems using .NET 1.0 (back in the year 2002), I was excited with the idea of using "XML SOAP WebServices" to communicate my client side application with my remote side business logic... but, after I started developing it, I realized I had to do a lot of stuff that just seemed repetitive and hard to use, why did I have to create a WebMethod for each of the CRUD operations... and many for the "R" in CRUD for each possible "select" variation... and it was even more problematic because sometimes I just had to have a "dynamic querying UI" and couldn't find a good way to send the definition of a query in a good "XML" way...
Then I realized... why should I create a method for each variant? why not just have a single web method:

DataSet executeQuery(string Query)

And I started changing all my code, anything that I wanted to obtain from the server could be obtained that way... (but... I started wondering... is that the one and only true way to use data oriented web services? I remember reading somewhere that wasn't such a good idea.. that SOA wasn't invented for that.. after all, that was just a thin XML wrapper over my ADO.NET data provider...)

Fast forward a few years.... a lot of people start talking about a doctoral dissertation written by Roy Fielding... and the reach the following conclusion "SOAP is just too complex" ,"Having to create a different web method for each action makes the interface complex and not that good for inter operation", "one needs to know too much to understand a SOAP web service because methods are not standard", "WSDL is too complex", "SOAP is going against the resource naming philosophy in HTTP", etc, etc.... And REST is the answer to all our problems...

Well here I am taking a look to the experimental REST Framework "Astoria" Microsoft is creating, and... it looks painfully similar to DataSet executeQuery(string Query), but it has a difference... it is not using SQL... it is using a custom ad hoc querying mechanism... that... does the same things SQL does?.. Perhaps it will implement better some relational features that are badly designer in SQL but... what is the real advantage here? what is the real difference between this and a SOAP web services that receives SQL and returns an XML representing rows in a database?

Is there really a difference? or it just that we (as an industry) needed to invent SOAP webservices to realize all we needed was a thin XML wrapper around SQL?

Update: Astoria is now known as WCF DataServices

Saturday, July 28, 2007

Why is validation so hard?

Here I am, trying to validate my persistent business objects before committing a transaction... since I am using Hibernate... that means they are POJOs...

Hibernate has its validation framework, that allows for validation using Java 5 @Annotations... it is a nice idea... but I don't feel that comfortable validating that way... Annotation based validation is fine for simple validation (not null, min/max size,etc) but is not that good for more complex stuff (validations formulas, stuff like "you can't buy that unless you have money in you account" or "a car has to have 4 wheels or it can not change is status to 'ready to run'").

The problem IMHO with the Hibernate Validator, is that it is triggered on "PreInsert" or "PreUpdate"... and those events are triggered each time a "Flush" is called (automatically or manually) but Flush is called with 2 different purposes, if is called explicitly, it often means "put this in to the database", and, when called automatically often means "put changes in to the database so that I can make queries without risk of inconsistencies" but it doesn't mean "the transaction is committed" (although a lot of people use it with that intention)... now... what if I want to validate only "just before when the transaction is committed", not "on flush"... (that can happen if I want to do complex validation that requires querying the database about its state, taking in consideration the modifications that my uncommitted POJOs will produce when flushed in to the database).

I believe that the main problem of the POJO nature of Hibernate persistence, is that POJOs do not know they are persistent, and therefore do not know that they need to be leave the database in a valid state after being flushed into it... I think Hibernate is missing a mechanism that can be called that does a kind of "fake commit" that applies all the changes to the database, then call a validation api that can check that all the applied changes are valid, and, only after that is verified, allow for the transaction to really commit (and if validation fails, that it should never commit, it should rollback).

In other words, it should be possible to validate stuff "before commit" not "before flushing", and it should be possible to flush 1 or more times before commit without having the validation triggered, since we might need to perform operations with the data flushed in to the database, and only if those operation give a valid result, commit...

The problem... I think, is that Hibernate event system doesn't cover "OnBeforeCommit" and even worse... it lacks a mechanism to inform this OnBeforeCommit of which objects were inserted, updated or deleted. (In fact, Hibernate knows that internally, but it doesn't expose an API to retrieve that information... and therefore transactions are blind to the changes that were flushed before the commit (and that makes it really hard to just call the validation algorithms of those entities that modified inside that particular transaction)

Monday, July 16, 2007

NoResultException is a really stupid idea!

When I saw Query.getSingleResult() I thought, "yes, great idea, I always have to add an utility method like that..."
But then, I met NoResultException... what a great way to screw a great idea!
Why not just return null??!!! getSingleResult should return 1 element, or null if it can not find stuff!

Saturday, July 14, 2007

Unit testing Relational Queries (SQL or HQL or JQL or LINQ...)

Hi!
Most applications I have built have something to do with a database (I remember that while I was on college I used to think that was not exiting stuff, I used to dream about doing neural networks stuff, logic programming in Prolog, etc) but then I met WebObjects and its Object Relational Mapper (Enterprise Objects Framework) and I got really excited about object oriented data programming... but I always had a problem... to test my object oriented queries I had to "manually" translate them into SQL and test them against the database, and only after they give me what I thought were correct result I would write them using EOF Qualifiers...
Then I met Hibernate's HQL and I realized it is much more powerful than EOF Qualifiers, but I still had to translate it to SQL to test it, I know I can get the SQL that is generated from the HQL from the debug console, and paste it in my favorite SQL editor... but even then if I found a mistake, a lot of times it was easier to tweak it in SQL and the manually translate it HQL.
Currently, there are some extensions for Eclipse (Hibernate Tools) that make this more "direct" but, what if I don't like (or don't want, or can't) use Eclipse... it would be great if someone could, for example, make a plugin for SquirrelSQL, but until then... what options do I have?

Then I learned about unit testing... and the answer came to my mind immediately: I just had to write a unit test for each of my queries. That worked fine... in the beginning... until I started having queries that returned thousands (or millions) of objects, and it wasn't such a good idea to output them to the debug console... and I had another problem... how should I write the "asserts" of query?... and how can I do it so that it doesn't make my test so slow that it becomes unusable? (I can, of course, check the results just by viewing them, but my brain is not that good to say if those 10,000 row really match with the idea I had when I wrote that HQL)

So, I started to look "what do I do" to check if an SQL query is correct, lets say for example, that I write this:

select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3

(Translated to English: How many address the Employee with Id = 3 has?)

Now... how do I test that? well I could add an assert after getting the result in java (or c#) like this:

assert(count>0)

But the, what happens if someone deletes the row with Id = 3 from the table Employee? That means my test will fail... or what what if someone deletes all the addresses from employee? and what if I want to test that if there are no addresses for an employee, the answer should be zero...

That is a lot of work just to test if that simple query is right... and I think that work could be done automatically:

Take a look at the SQL, it could be decomposed into:


select count(*) from (select * from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3) as EmployeeAddresses

And then we could say, lets automatically check for the case when the resulting set is empty, and for the case when the result is not empty, to check for a case when the result is empty, we need and Employee with Addresses, so we generate:

Select * from Employee where exists(select * from Address where Address.EmployeeId = Employee.Id)

And take the Id first employee we get... and that should give us a non empty set if used in the original sql sentence that we are trying to test... after that, we automatically generate:

Select * from Employee where not exists(select * from Address where Address.EmployeeId = Employee.Id)

And take the Id first employee we get... and that should give us an empty set if used in the original sql sentence that we are trying to test...

I call this queries "inverses" of the original one, it like when one is testing a multiplication, to see if 2 x 3 = 6, just do: 6/3 = 2 and 6/2 = 3, if 2, and 3 match the operands of the multiplication, you multiplication is right. The same thing goes for SQL, one just has to find the way to "invert" it, if I could automate this inversion, the automatically generated queries would help me by telling me things that might not be immediately obvious to me when I look at the original query, and that would help me check if my original query is right.... it would me some kind of "invariants" that would help me to better understand my querying... or maybe I could even write the invariants first, and then create a query and see if it matches my invariants...

Mmmm.... maybe using a select there is another way to "invert" a query to test if it is right, using the actual inverse operation of selecting... that is "inserting", could I derive from:

select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3

Something like (In pseudocode):

Insert Employee;

Store Employee.Id

Run select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = EmployeeId

Assert("The answer should be zero")

Insert Address related to Employee

Run select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = EmployeeId

Assert("The answer should be one")

This has the advantage that I don't need a database with data already on it, but it has the disadvantage that takes lot of time to write an unit test like this in java, because to insert an employee, it might be necessary to:

Avoid breaking validation rules no related to this particular test, for example, an Employee must be related to a Department, but if the Department table is empty, then I should create a Department or I will not be able to insert an employee.
Avoid conflicts with validation rules directly related to this particular test, for example, what if I have an Hibernate interceptor that won't let me insert an address without 1 or more Addresses

The main problem here, I believe, is that f I insert a row leaving a not null column empty most databases won't wait until I try to commit the transaction to say "integrity violation" and rollback my changes... therefore it is impossible to write partial data just for the test that I have in front of me, but... could I automatically generate consistent inserts using as a source just the integrity rules at the database level... and the select that I want to test?
I think it can be done... the question is..
What is the algorithm to generate the inserts needed to satisfy an SQL select statement?

Thursday, July 05, 2007

The perfect infrastructure (framework?) for data systems

The perfect infrastructure (framework?) for data systems:

Has an object relational query language (something like JQL or LINQ)
Has a database with versioning (like Subversion) so you can always consult the database as it was in a particular moment in time transparently
Supports transactions... and distributed transactions. (like Spring)
Has a framework to exchange graphs of objects with a remote client, objects can be manipulated freely on the client, filtered and queried without hitting the database without need, and are transparently loaded in to de client without having the n+1 problem. (like an hybrid between Hibernate, Apple's EOF & Carrierwave)
Supports "client only transactions" and nested client only transactions (like the Editing Context in WebObject's JavaClient applications) so that it is possible to make rollbacks without hitting the database, and it is even possible to make partial rollbacks... and have savepoint functionality, without going all the way to the database (unless you want to do so)
Client objects, server objects and database elements are kept in perfect sync automatically, but it is possible to add logic to a particular tier of the system without too much hassle.
Has a validation framework, that make it really easy to write efficient validation code following DRY, and that validates data on the client, on the application server and on the database.
Validation code, combined with the versioning capabilities of the infrastructure allows to save information partially, as easily as writing part of a paper, validating only as you completely the information, with multiple integrity levels
It is possible to disconnect the client from the server, and it will be able to save your changes until the connection is established again
The applications built with this perfect infrastructure auto update automatically.
With a very simple configuration tweak, it is possible to download the application "sliced" in pages, or as a complete bundle. This capability is so well integrated that the final user can choose the installation method, and the programmer doesn't even care about this feature.
The developer only needs to specify the requirements semi-formally in a language (like Amalgam) and he will receive a running application, that adapts dynamically to specification (unless he chooses to "freeze" a particular feature of the application, in which case, the default procedural code for that feature is automatically generated, and the developer can customize as he wishes ... or decide to un-freeze it.
Can be coded in any language compatible with a virtual machine that runs anywhere, or can be compiled to an specific platform.
Allows for easy report design... by the developer, or the user.
It is opensource (or sharedsource), so that in the extremely unlikely case of needing another feature, or finding a bug, it can be easily fixed by the developer
It is freely (as in beer) downloadable from the Internet. (or has a reasonable price)
It is fully documented, with lots of examples, going from very simple examples for beginners, to really complex real world applications with best practices for experts
Includes the source code with unit-test with 100% coverage of the code
Supports design by contract coding (from the database up to the client side).

You know what is the funny (or sad) part of all this? I have met frameworks that do 1, 2 or even 3 or more of this features... but none that does them all... will I ever see such a thing? is even possible to build it?