How the SQLCAT Customer Lab is Monitoring SQL on Linux

Reviewed By: Denzil Ribeiro, Dimitri Furman, Mike Weiner, Rajesh Setlem, Murshed Zaman

Background

SQLCAT often works with early adopter customers, bring them into our lab, and run their workloads. With SQL Server now available on Linux, we needed a way to visualize performance and PerfMon, being a Windows only tool, was no longer an option. After a lot of research on ways to monitor performance in Linux, we didn’t find a de facto standard. However, we did learn that in the open source community there are many ways of accomplishing a goal and that there is no one “right way”, rather choose the way that works best for you.

The following solutions were tested:

  • Graphing with Grafana and Graphite
  • Collection with collectd and Telegraf
  • Storage with Graphite/Whisper and InfluxDB

We landed on a solution which uses InfluxDB, collectd and Grafana. InfluxDB gave us the performance and flexibility we needed, collectd is a light weight tool to collect system performance information, and Grafana is a rich and interactive tool for visualizing the data.
In the sections below, we will provide you with all the steps necessary to setup this same solution in your environment quickly and easily. Details include step-by-step setup and configuration instructions, along with a pointer to the complete GitHub project.

Solution Diagram

Here is the high-level architecture of how this solution works. Collectd continuously runs in a container on your SQL Server on Linux environment and pushes metrics to InfluxDB. The data is then visualized via the Grafana dashboard, which reads data from InfluxDB when Grafana requests it.

Setup

When we found a set of tools that let us easily visualize the performance for troubleshooting purposes , we wanted to provide an easy, repeatable method for deployment using Docker. The directions below will walk you through setting this up using our Docker images. The complete mssql-monitoring GitHub project can be found here. Give it a try, we welcome feedback on your experience.

Prerequisites

  1. Access to docker.io and GitHub for pulling Docker images and accessing the GitHub repository.
  2. 1 – 2 Linux machines for running InfluxDB and Grafana, depending on how large your deployment is.
    • If using 2 machines, 1 machine will be used for hosting the InfluxDB container and the second machine will be used for hosting the Grafana container
    • If using 1 machine, it will be used for hosting both the InfluxDB and Grafana containers.
  3. InfluxDB opened ports: 25826 (default inbound data to InfluxDB), 8086 (default outbound queries from Grafana)
  4. Grafana opened port: 3000 (default web port for inbound connections)
  5. A SQL Server on Linux machine or VM that you would like to monitor.

Setting up InfluxDB

Note: If running on RHEL, you will need to add exclusions or disable SELinux. For more information, please see Working with SELinux

For sizing InfluxDB, you can refer to the InfluxDB documentation. Also, note that it is recommended to provision SSD volumes for the InfluxDB data and wal directories. In our experience this has not been necessary when monitoring just a few machines.

  1. Install Docker Engine (if not already installed)

    • For RHEL:

       yum install docker -y
      
    • For Ubuntu:

       wget -qO- https://get.docker.com/ | sudo sh
      
  2. Install Git for your distro (if not already installed)

    • For RHEL:

       yum install git -y
      
    • For Ubuntu:

       apt-get install git -y
      
  3. Clone the mssql-monitoring GitHub repository

     git clone https://github.com/Microsoft/mssql-monitoring.git
    
  4. Browse to mssql-monitoring/influxdb

     cd mssql-monitoring/influxdb
    
  5. Edit run.sh and change the variables to match your environment

    
    # By default, this will run without modification, but if you want to change where the data directory gets mapped, you can do that here
    # Make sure this folder exists on the host.
    # This directory from the host gets passed through to the docker container.
    INFLUXDB_HOST_DIRECTORY="/mnt/influxdb"
    
    # This is where the mapped host directory get mapped to in the docker container.
    INFLUXDB_GUEST_DIRECTORY="/host/influxdb"
    
  6. Execute run.sh. This will pull down the mssql-monitoring-InfluxDB image and create and run the container

Setting up collectd on the Linux SQL Server you want to monitor

Note: These commands have to be run on the SQL Server on Linux VM/box that you want to monitor

  1. Using SSMS or SQLCMD, create a SQL account to be used with collectd.

    
    USE master; 
    GO
    CREATE LOGIN [collectd] WITH PASSWORD = N'mystrongpassword';
    GO
    GRANT VIEW SERVER STATE TO [collectd]; 
    GO
    GRANT VIEW ANY DEFINITION TO [collectd]; 
    GO
    
  2. Install Docker Engine (if not already installed)

    • For RHEL:

       yum install docker -y
      
    • For Ubuntu:

       wget -qO- https://get.docker.com/ | sudo sh
      
  3. Install Git for your distro (if not already installed)

    • For RHEL:

       yum install git -y
      
    • For Ubuntu:

       apt-get install git -y
      
  4. Clone the mssql-monitoring GitHub repository

     git clone https://github.com/Microsoft/mssql-monitoring.git
    
  5. Browse to mssql-monitoring/collectd

     cd mssql-monitoring/collectd
    
  6. Edit run.sh and change the variables to match your environment

    
    #The ip address of the InfluxDB server collecting collectd metrics
    INFLUX_DB_SERVER="localhost"
    
    #The port that your InfluxDB is listening for collectd traffic
    INFLUX_DB_PORT="25826"
    
    #The host name of the server you are monitoring. This is the value that shows up under hosts on the Grafana dashboard
    SQL_HOSTNAME="MyHostName"
    
    #The username you created from step 1
    SQL_USERNAME="sqluser"
    
    #The password you created from step 1
    SQL_PASSWORD="strongsqlpassword"
    
  7. Execute run.sh. This will pull down the mssql-monitoring-collectd image, set it to start on reboot and create and run the container

Setting up Grafana

If you are doing a small scale setup (monitoring a few machines), you should be fine running this on the same host as your InfluxDB container. We use the image created by Grafana Labs with an addition of a run.sh file that you can use to create and run the container.

  1. Install Docker Engine (if not already installed)

    • For RHEL:

       yum install docker -y
      
    • For Ubuntu:

       wget -qO- https://get.docker.com/ | sudo sh
      
  2. Install Git for your distro (if not already installed)

    • For RHEL:

       yum install git -y
      
    • For Ubuntu:

       apt-get install git -y
      
  3. Clone the mssql-monitoring GitHub repository

     git clone https://github.com/Microsoft/mssql-monitoring.git
    
  4. Browse to mssql-monitoring/grafana

     cd mssql-monitoring/grafana
    
  5. Edit run.sh and change the variables to match your environment

    
    # We use the grafana image that Grafana Labs provides https://docs.grafana.org/installation/docker/
    # If you wish to modify the port that Grafana runs on, you can do that here.
    sudo docker run -d -p 3000:3000 --name grafana grafana/grafana
    
  6. Run run.sh. This will pull down the mssql-monitoring-grafana image and create and run the container

Configuring the InfluxDB data source in Grafana

In order for Grafana to pull data from InfluxDB, we will need to setup the data source in Grafana.

  1. Browse to your Grafana instance
    • https://[GRAFANA_IP_ADDRESS]:3000
    • Login with default user admin and password admin
  2. Click "Add data source"
    • Name: influxdb
    • Type: InfluxDB
    • Url: https://[INFLUXDB_IP_ADDRESS]:8086
    • Database: collectd_db
  3. Click "Save & Test"

Importing Grafana dashboards

We have a set of dashboards that we use and have made available to the community. These dashboards are included in the GitHub repository: mssql-monitoring. Just download them and import them in Grafana. Once the dashboards are imported, you will see metrics that collectd, running on your SQL Server, is pushing to InfluxDB.

How the data gets loaded

In this solution, we leverage collectd and several plugins to get data from the system(s) we are monitoring. Specifically, on the SQL Server side, we leverage the collectd DBI plugin with the FreeTDS driver, and execute the following queries every 5 seconds, using sys.dm_os_performance_counters and sys.dm_wait_stats DMVs. You can view the complete collectd.conf file here. These specific counters and waits provided a good starting point for us, but you can experiment and change as you see fit.

sys.dm_os_performance_counters query

For this query, we needed to replace spaces with underscores in counter and instance names to make them friendly for InfluxDB. We also do not need to reference the counter type field (cntr_type) since the logic to do the delta calculation is done in Grafana with the non-negative derivative function. To find out more about counter types and implementation, please see: Querying Performance Counters in SQL Server by Jason Strate and Collecting performance counter values from a SQL Azure database by Dimitri Furman

 
SELECT Replace(Rtrim(counter_name), ' ', '_')  AS counter_name, 
       Replace(Rtrim(instance_name), ' ', '_') AS instance_name, 
       cntr_value 
FROM   sys.dm_os_performance_counters 
WHERE  ( counter_name IN ( 'SQL Compilations/sec',
                            'SQL Re-Compilations/sec', 
                            'User Connections',
                            'Batch Requests/sec',
                            'Logouts/sec', 
                            'Logins/sec', 
                            'Processes blocked', 
                            'Latch Waits/sec',
                            'Full Scans/sec', 
                            'Index Searches/sec',
                            'Page Splits/sec', 
                            'Page Lookups/sec', 
                            'Page Reads/sec', 
                            'Page Writes/sec', 
                            'Readahead Pages/sec', 
                            'Lazy Writes/sec', 
                            'Checkpoint Pages/sec', 
                            'Database Cache Memory (KB)', 
                            'Log Pool Memory (KB)', 
                            'Optimizer Memory (KB)', 
                            'SQL Cache Memory (KB)', 
                            'Connection Memory (KB)', 
                            'Lock Memory (KB)', 
                            'Memory broker clerk size', 
                            'Page life expectancy' ) ) 
OR ( instance_name IN ( '_Total', 
                        'Column store object pool' ) 
AND counter_name IN ( 'Transactions/sec', 
                        'Write Transactions/sec', 
                        'Log Flushes/sec', 
                        'Log Flush Wait Time', 
                        'Lock Timeouts/sec', 
                        'Number of Deadlocks/sec', 
                        'Lock Waits/sec', 
                        'Latch Waits/sec', 
                        'Memory broker clerk size', 
                        'Log Bytes Flushed/sec', 
                        'Bytes Sent to Replica/sec', 
                        'Log Send Queue', 
                        'Bytes Sent to Transport/sec', 
                        'Sends to Replica/sec', 
                        'Bytes Sent to Transport/sec', 
                        'Sends to Transport/sec', 
                        'Bytes Received from Replica/sec', 
                        'Receives from Replica/sec', 
                        'Flow Control Time (ms/sec)', 
                        'Flow Control/sec', 
                        'Resent Messages/sec', 
                        'Redone Bytes/sec') 
OR ( object_name = 'SQLServer:Database Replica' 
AND counter_name IN ( 'Log Bytes Received/sec', 
                        'Log Apply Pending Queue', 
                        'Redone Bytes/sec', 
                        'Recovery Queue', 
                        'Log Apply Ready Queue') 
AND instance_name = '_Total' ) )
OR ( object_name = 'SQLServer:Database Replica' 
AND counter_name IN ( 'Transaction Delay' ) )

sys.dm_os_wait_stats query

 
WITH waitcategorystats ( wait_category, 
                        wait_type, 
                        wait_time_ms, 
                        waiting_tasks_count, 
                        max_wait_time_ms) 
    AS (SELECT CASE 
                WHEN wait_type LIKE 'LCK%' THEN 'LOCKS' 
                WHEN wait_type LIKE 'PAGEIO%' THEN 'PAGE I/O LATCH' 
                WHEN wait_type LIKE 'PAGELATCH%' THEN 'PAGE LATCH (non-I/O)' 
                WHEN wait_type LIKE 'LATCH%' THEN 'LATCH (non-buffer)' 
                ELSE wait_type 
                END AS wait_category, 
                wait_type, 
                wait_time_ms, 
                waiting_tasks_count, 
                max_wait_time_ms 
    FROM   sys.dm_os_wait_stats 
    WHERE  wait_type NOT IN ( 'LAZYWRITER_SLEEP', 
            'CLR_AUTO_EVENT', 
            'CLR_MANUAL_EVENT',
            'REQUEST_FOR_DEADLOCK_SEARCH', 
            'BACKUPTHREAD', 
            'CHECKPOINT_QUEUE', 
            'EXECSYNC', 
            'FFT_RECOVERY', 
            'SNI_CRITICAL_SECTION', 
            'SOS_PHYS_PAGE_CACHE', 
            'CXROWSET_SYNC',
            'DAC_INIT', 
            'DIRTY_PAGE_POLL', 
            'PWAIT_ALL_COMPONENTS_INITIALIZED',
            'MSQL_XP', 
            'WAIT_FOR_RESULTS', 
            'DBMIRRORING_CMD', 
            'DBMIRROR_DBM_EVENT', 
            'DBMIRROR_EVENTS_QUEUE', 
            'DBMIRROR_WORKER_QUEUE', 
            'XE_TIMER_EVENT', 
            'XE_DISPATCHER_WAIT', 
            'WAITFOR_TASKSHUTDOWN', 
            'WAIT_FOR_RESULTS', 
            'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 
            'WAITFOR', 
            'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', 
            'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', 
            'HADR_FILESTREAM_IOMGR_IOCOMPLETION', 
            'LOGMGR_QUEUE', 
            'FSAGENT' ) 
    AND wait_type NOT LIKE 'PREEMPTIVE%' 
    AND wait_type NOT LIKE 'SQLTRACE%' 
    AND wait_type NOT LIKE 'SLEEP%' 
    AND wait_type NOT LIKE 'FT_%' 
    AND wait_type NOT LIKE 'XE%' 
    AND wait_type NOT LIKE 'BROKER%' 
    AND wait_type NOT LIKE 'DISPATCHER%' 
    AND wait_type NOT LIKE 'PWAIT%' 
    AND wait_type NOT LIKE 'SP_SERVER%') 
SELECT wait_category, 
       Sum(wait_time_ms)        AS wait_time_ms, 
       Sum(waiting_tasks_count) AS waiting_tasks_count, 
       Max(max_wait_time_ms)    AS max_wait_time_ms 
FROM   waitcategorystats 
WHERE  wait_time_ms > 100 
GROUP  BY wait_category

Dashboard Overview

With the metrics that we collect from the collectd system plugins and the DBI plugin, we are able to chart the following metrics over time and in near real time, with up to 5 second data latency. The following are a snapshot of metrics that we graph in Grafana (Clicking the images will enlarge them).

Core Server Metrics

Core SQL Metrics

Comments

  • Anonymous
    July 05, 2017
    great article :), which variables we have to change on "run.sh" for setting up influxDB ? can you explain more ?
    • Anonymous
      July 05, 2017
      George the run.sh parameters for influx DB are here: https://github.com/Microsoft/mssql-monitoring/blob/master/influxdb/run.sh Effectively changing them in that file and running it launches the docker image with those variables passed which are then configured in the influxdb conf file.INFLUXDB_HOST_DIRECTORY="/mnt/influxdb"# This is where the mapped host directory get mapped to in the docker container.INFLUXDB_GUEST_DIRECTORY="/host/influxdb"# This is where the InfluxDB directories gets mapped in the container.INFLUXDB_DATA_DIRECTORY="/host/influxdb/data/db"INFLUXDB_META_DATA_DIRECTORY="/host/influxdb/data/meta"INFLUXDB_WAL_DATA_DIRECTORY="/host/influxdb/wal/wal"INFLUXDB_HH_DATA_DIRECTORY="/host/influxdb/data/hh"# The port that InfluxDB is listening for collectd metrics.INFLUXDB_COLLECTD_LISTEN_PORT="25826"
      • Anonymous
        July 05, 2017
        I see Mr Denzil, you said that we have to change the variables of "run.sh" to match our environment, but i don't know what values i have to give them in order to do that ?
        • Anonymous
          July 06, 2017
          There are variables in each of the run.sh files for each of the images that you have to modify to suit your environmentFor example the run.sh file for the collectd docker image has variables such as SQL Username and password to connect to the SQL Server and run DMV queriesOn the collectd docker image for example you have to give it the name of the influxDB server, the user name and password to connect to SQL and monitorINFLUX_DB_SERVER="localhost"#The username used to connect to SQL Server.SQL_USERNAME="sa"#The password of the account used to connect to SQL ServerSQL_PASSWORD="sa_password"
  • Anonymous
    July 05, 2017
    it works like a charm, thank you
    • Anonymous
      July 05, 2017
      The dashboard "Core SQL metrics" does not work, it displays no data points. Please can you explain how to fix it ?
    • Anonymous
      July 06, 2017
      Great!
  • Anonymous
    July 05, 2017
    Any advantages or disadvantages of using telegraf vs. collectd? We have been using the SQL Server dashboard published by sqlzenmonitor on Grafana labs (https://grafana.com/dashboards/409/revisions) for sometime with great success.
    • Anonymous
      July 05, 2017
      Tommy, We had this done internally around the telegraf SQL dashboard was released just got around to pushing it out. We wanted just the most light weight DMV queries only. Absolutely this can be done with Telegraf as well, was just a choice of tool at that time.
    • Anonymous
      July 12, 2017
      Not only the dashboard but the complete sqlserver plugin ;-)Glad to hear you like the plugin. It works also for SQL Server on Linux excepted for OS Volume and CPU that use sys.dm_os_volume_stats and sys.dm_os_ring_buffers.I reassure Denzil, queries I use are also light weight.
  • Anonymous
    July 06, 2017
    How the solution will access SQL Server Database to run the DMV queries without user Authentication ?
    • Anonymous
      July 06, 2017
      Siraj, it is collectd who does the authentication behind. you just have to change the values of the variables listed below, on the "/mssql-monitoring/collectd/run.sh" file, in order for them to match your SA user credentials. #The host name of the server you are monitoring. This is the value that shows up under hosts on the Grafana dashboardSQL_HOSTNAME="MyHostName"#The username used to connect to SQL Server.SQL_USERNAME="sa"#The password of the account used to connect to SQL ServerSQL_PASSWORD="sa_password"
      • Anonymous
        July 06, 2017
        We will update the documentation a bit. Any login with VIEW SERVER STATE / VIEW ANY DEFINITION permissions should work ( Does not have to be sa)
        • Anonymous
          July 17, 2017
          Can you explain to me why does not have to be SA ?
          • Anonymous
            July 17, 2017
            @Zakaria JdiThis does not need to be an SA account. I have updated the documentation to reflect the changes under "Setting up collectd on the Linux SQL Server you want to monitor". As Denzil called out above, collection will work with any account that has the following permissions: VIEW SERVER STATE / VIEW ANY DEFINITION.
  • Anonymous
    July 06, 2017
    please i would like to know also, if this solution gives both system (OS) and SQL Server performance counters ?
    • Anonymous
      July 06, 2017
      Yes it does give you both OS and SQL with different dashboards for each - See the screenshots in the blog post.
  • Anonymous
    November 09, 2018
    The comment has been removed
    • Anonymous
      November 16, 2018
      Hi @HautjThis example was setup to monitor a single instance and did not test running against multiple instances. Reading the DBI documentation, it looks like what you are trying to do is supported and not sure why you are getting the errors that you are getting. Here is the documentation on the DBI plugin: https://collectd.org/documentation/manpages/collectd.conf.5.shtml#plugin_dbiOne thing that we had done in the past to monitor multiple instances was to create a separate docker container running collectd with only the DBI plugin and this ended up working well for us.