T-SQL: How to Get the Latest Costs

This article is aimed at developers new to T-SQL and SQL server.  Most use will be gained by the developer who wants a pointer to alternative techniques or who has an immediate problem to solve. 

It is intended as something of a taster - a starting point for further research.  If the reader is applying a quick fix it is strongly recommended they investigate further once the dust has settled on their problem.  Along the way some general tips are provided and the new developer should notice the layout of scripts.  There is no auto format SQL feature in Microsoft SQL Server Management Studio.

If the reader is already an experienced SQL developer then there will probably be little new herein.

The logic and database is presented only to illustrate issues and techniques.  Real world logic will be sacrificed in an attempt to make the points clearer.

SalesSystem Database Diagram

SubQuery

We start out with the task of getting the latest TripCost which should apply for the date an order will be Delivered on.
Notice that there is no particular relationship between HaulierCosts and Orders.

We can do this using a SubQuery - as it's name suggests this is a select query within our select query.

A subquery can be used to obtain a single value from a set of data which cannot be obtained purely by a join and simple where clause.  A classic example of this is where you want a value out the most recent qualifying record from many records which would qualify on a date range.

HaulierCosts apply for a period between StartDate and  EndDate.  The people signing contracts for these things are a bit sloppy, if there are two qualifying we want the most recent StartDate as this will be the latest contract.  For this example, HaulierUsed specifies which Haulier will be used.

We will use the TripCost from HaulierCosts per Order.

use SalesSystem
 
Select 
    o.Id as OrderId 
   ,(select top 1 
            TripCost from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             ) as tripCost
 
from Orders o

This query might look a bit confusing at first to the reader unfamiliar with subqueries.  The main outer query is returning everything from the Order table and aliases it as o .

The inner select within the brackets is the subquery and that joins Haulier to HaulierUsed to obtain the one haulier used for the period.  There will be several sets of costs which qualify so the subquery is sorted in descending StartDate and the Top 1 returns the most recent cost.

A common mistake with subqueries is to forget that only one row with one value can be returned.  If you don’t use Top 1 or some aggregate function then an error will be produced.

Correlated Subquery

Let’s change the requirement slightly.  Say the calculation must take into account the weight of the product carried as well as the flat TripCost.  For each Order we now also need to total each OrderLine Quantity * Product Weight.

This particular calculation is (deliberately) complicated slightly because we will be using a join to HaulierCosts and there is nothing to join on. Another complication is that there is going to be one cost per order and another cost for each order line.

Similar sort of rules apply to writing SQL as code.  With a complicated piece of code you would break it down into pieces and tackle each in turn proving you could do something simple then iterating to add a bit more functionality, repeating the process with only some known small change that could go wrong.
As this is that bit more complicated it's best to approach this in steps.

The first aspect to consider is getting the two values we need.  A SubQuery in the way we used it in the first script is no good since we need two values.  We need a slightly different approach.

The technique is called a Correlated Subquery and how this works is we join to the table we want the data from and use a subquery to return a key to use in our where clause defining which one out the many joined that we want to pick.

Since there’s no relationship between Orders and HaulierCosts we use a Cross Join. 

use SalesSystem
 
Select 
     o.Id as OrderId 
   , hc.TripCost as TripCost
   , hc.KgCost as KgCost
from Orders o
  cross join HaulierCosts hc
where  hc.Id =
  (select top 1 
            h.Id 
            from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             )

There’s a similar sort of subquery to the one we had last time, but in the where clause. Give it a go and it returns all records as expected. 
We now have the OrderId and the costs.

Let’s add OrderLines into the mix.

use SalesSystem
 
Select 
     o.Id as OrderId 
   , hc.TripCost as TripCost
   , hc.KgCost as KgCost
from Orders o
  inner join OrderLines ol on o.Id = ol.OrderId
  cross join HaulierCosts hc
where  hc.Id =
  (select top 1 
            h.Id 
            from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             )

All very well, but when we run that there is a row per OrderLine and we need a cost per Order.

Let’s join to products as well and group by order so we can have the OrderId and TripCost.  The calculation per product line also needs to be totalled

use SalesSystem
 
Select 
     o.Id as OrderId 
   , max(hc.TripCost) as TripCost
   , sum(hc.KgCost * ol.Quantity * p.Weight) as KgCost
from Orders o
  inner join OrderLines ol on o.Id = ol.OrderId
  inner join Products p on ol.ProductId = p.Id
  cross join HaulierCosts hc
where  hc.Id =
  (select top 1 
            h.Id 
            from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             )
group by o.Id

Since we're grouping by o.Id we have to use aggregate functions on the other variables getting just the one value for TripCost and the sum for KgCost.
Note
If we also grouped by hc.TripCost we could drop that max aggregate function.
The group by also highlights a subtlety of SQL in that we are aliasing that as OrderId but that sort of only happens after the SQL runs and where we're using that group by we need to use o.Id.  That can catch out the occasional SQL user until they get used to how aliasing works.

Denormalisation

Flushed with confidence in our new found SQL skills, we take the query developed against the development database and run it against the live one.

Uh Oh.

We have 2 million records in this live database and it takes way too long to run.  The costs are supposed to appear online when the CEO wants to see on his laptop how things are going over a coffee.

Maybe we buy him a couple more coffees?
OK, Maybe not.

If we need that data to appear quickly then we need a plan B.

In many businesses, the numbers as they were yesterday are pretty much the same as today.   If this is full SQL server then we can use a batch job to run a stored procedure ( or perhaps SSIS ) and store the result in a table.

Technically speaking this is redundant data – we already have the base numbers in the database.  It’s just that for practical reasons we need the calculated result.  This is called Denormalisation.

So we add the OrderCosts table.

This has a primary Key of OrderId – but this is not one of those identity autonumber ones, just a primary key.

CREATE TABLE [dbo].[OrderCosts](
    [OrderId] [int] NOT NULL,
    [TripCost] [money] NULL,
    [KgCost] [money] NULL,
 CONSTRAINT [PK_OrderCosts] PRIMARY KEY CLUSTERED 
(
    [OrderId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Next up is a script to wipe the table and replace it with out calculated data.

We use Truncate because it's quicker than delete - it does not log each record as deleted.  One potential issue elsewhere to be aware of with Truncate is that foreign key constraints will complicate things.  Look that up if you ever want to Truncate a "proper" table.

Once this script has been run we can obtain our costs from OrderCosts  by joining from Orders.

As we’re denormalising, let’s also total the costs into the Order.  You would probably do one or the other in a real world application.

use SalesSystem
 
Truncate table OrderCosts
 
insert into  OrderCosts
select
     o.Id as OrderId 
   , max(hc.TripCost) as TripCost
   , sum(hc.KgCost * ol.Quantity * p.Weight) as KgCost
from Orders o
  inner join OrderLines ol on o.Id = ol.OrderId
  inner join Products p on ol.ProductId = p.Id
  cross join HaulierCosts hc
where  hc.Id =
  (select top 1 
            h.Id 
            from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             )
group by o.Id
 
update Orders 
 set HaulageCost = oc.KgCost + oc.TripCost
 from Orders o
inner join OrderCosts oc on o.Id = oc.OrderId
 
 
select 
   TripCost
  ,KgCost
  ,DeliveryDate 
from OrderCosts oc inner join orders o on oc.OrderId = o.Id

SSIS

This is mentioned here because the developer who has SSIS available can use aggregation in the pipeline to do calculations which are then passed to multiple tasks.

SSIS might well be a better candidate than the next two if the developer has time to learn yet another skill set.

Temporary Table

Let’s imagine the requirement for data manipulation is more complicated again. Maybe the capacity of each vehicle comes into it, the miles the lorry travels between warehouse and customer.

One way to handle very complicated requirements is to break them down into smaller simpler queries, insert with the first query and then each one after updates for the column or columns it is calculating.

Maybe the requirement is for overnight reporting rather than online viewing so our manager rejects Denormalisation and tells us to do it on the fly.

In this case we can use a Temporary table.

These are prefixed by # if the scope is the current session and ## for all sessions.

Let’s assume our requirement can somehow be handled in the one script.

We first need to create our table.

A tip here if you’re not used to TSQL is to do this in the SQL Server management studio designer as a regular table, refresh the tables to see it in the object explorer then right click the table.  From the context menu choose>Script Table as > Create To > New editor window
Which gives you a script which is mostly not so bad to follow until you get to the key part - the CONSTRAINT part and that stuff on the end.  
You don't really need to learn all about that for now.
You can pretty much just substitute # for the dbo. In front of the table.

Our script must create the table and insert into it.
In this case it's pretty easy and we can just prefix the select with an insert into

CREATE TABLE #OrderCosts(
    [OrderId] [int] NOT NULL,
    [TripCost] [money] NULL,
    [KgCost] [money] NULL,
 CONSTRAINT [PK_OrderCosts] PRIMARY KEY CLUSTERED 
(
    [OrderId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
 
insert into  #OrderCosts
select
     o.Id as OrderId 
   , max(hc.TripCost) as TripCost
   , sum(hc.KgCost * ol.Quantity * p.Weight) as KgCost
from Orders o
  inner join OrderLines ol on o.Id = ol.OrderId
  inner join Products p on ol.ProductId = p.Id
  cross join HaulierCosts hc
where  hc.Id =
  (select top 1 
            h.Id 
            from HaulierCosts h 
            inner join HaulierUsed hu on h.HaulierId = hu.Id 
            where o.DeliveryDate  between h.StartDate and h.EndDate
              and o.DeliveryDate  between hu.DateUsedFrom and hu.DateUsedTo
            order by h.StartDate desc
             )
group by o.Id
 
 
select 
   TripCost
  ,KgCost
  ,DeliveryDate 
from #OrderCosts oc inner join orders o on oc.OrderId = o.Id

We can then use that temporary table in selects following it, so long as they are still within the same script.

Table Variables

The journeyman developer might have come across the Table variable which some claim are faster than temp tables.   The reader is supposed to be pretty new to this T-SQL subject and therefore just wants a simple answer.

The simple approach is to only use Temporary tables.

 If you are working on a function then you cannot use a temporary table and will have to use a table variable.  Almost by definition though, the developer working on a function is unlikely to be amongst the intended audience for this article.

Notes

The sample code and database files can be downloaded here

Further reading:

Stackoverflow: When should I use a table variable

This one goes into great detail, an awful lot of detail.
Stackexchange: Whats the difference between a temp table and a table variable