Monitoring TokuMX

This article will help you get the Tokumx plugin for sd-agent configured and returning metrics

Installing the tokumx plugin package

Install the tokumx plugin on Debian/Ubuntu:

sudo apt-get install sd-agent-tokumx

Install the tokumx plugin on RHEL/CentOS:

sudo yum install sd-agent-tokumx

Read more about agent plugins.

Configuring the agent to monitor TokuMX

1. Connect to the mongo shell and create a read only monitoring user. Make sure to authenticate before doing so:

mongo
use admin
db.addUser("serverdensity", "supersecurepassword", true)

2. Configure /etc/sd-agent/conf.d/tokumx.yaml

init_config:

instances:
  # Specify the MongoDB URI, with database to use for reporting (defaults to "admin")
  # E.g. mongodb://localhost:27017/my-db
  - server: mongodb://localhost:27017
    username: serverdensity
    password: supersecretpassword
  • If you run the database server on a non-standard port, you need to declare the path to the sock or want to execute check, amend the rest of the config file as necessary.
  • NOTE: It is possible to use a connection string in the server option, such as mongodb://serverdensity:supersecurepassword@localhost:27016/my-db. However this is NOT recommended as the server string is automatically used as a tag and this will expose your user and password in the Server Density UI.

3. Restart the agent

sudo /etc/init.d/sd-agent restart

or

sudo systemctl restart sd-agent

Connecting over SSL

We can connect to your TokuMX instances using SSL but it may require some changes to the agent if you are using certificate files.

  • If you are not using certificate files

You can just specify ?ssl=true in the connection string e.g. in step 2 of the installation instructions, then specify mongodb://hostname:port/?ssl=true as part of the config.

  • If you are using certificate files

If you are using cert files then you'll need to tell the agent where they are. Set these 4 lines in your {agentdir}/conf.d/tokumx.yaml file:

ssl: True # Optional (default to False)
ssl_keyfile: /path/to/key.file
ssl_certfile: /path/to/cert.file

Verifying the configuration
Execute info to verify the configuration with the following:

sudo /etc/init.d/sd-agent info 

or

/usr/share/python/sd-agent/agent.py info

If the agent has been configured correctly you'll see an output such as:

tokumx
-----
  - instance #0 [OK]
  - Collected * metrics

You can also view the metrics returned with the following command:

sudo -u sd-agent /usr/share/python/sd-agent/agent.py check tokumx

Configuring graphs

Click the name of your server from the Devices list in your Server Density account then go to the Metrics tab. Click the + Graph button on the right then choose the tokumx metrics to display the graphs. The metrics will also be available to select when building dashboard graphs.

Screen_Shot_2018-01-18_at_11.46.12.png

Monitored metrics

Metric Values
tokumx.asserts.msgps

The number of message assertions raised per second.
assertion / second
Type: float
tokumx.asserts.regularps

The number of regular assertions raised per second.
assertion / second
Type: float
tokumx.asserts.rolloversps

The number of times that the rollover counters roll over per second. The counters rollover to zero every 2^30 assertions.
assertion / second
Type: float
tokumx.asserts.userps

The number of user assertions raised per second.
assertion / second
Type: float
tokumx.asserts.warningps

The number of warnings raised per second.
assertion / second
Type: float
tokumx.connections.available

The number of unused available incoming connections the database can provide.
connection / None
Type: float
tokumx.connections.current

The number of connections to the database server from clients.
connection / None
Type: float
tokumx.cursors.timedOut

The total number of cursors that have timed out since the server process started.
cursor / None
Type: float
tokumx.cursors.totalOpen

The number of cursors that tokumx is maintaining for clients.
cursor / None
Type: float
tokumx.ft.alerts.checkpointFailures

The number of checkpoints that have failed for any reason.
event / None
Type: float
tokumx.ft.alerts.locktreeRequestsPending

The number of requests for Document-level Locks in the locktree that are waiting for other requests to release their locks.
request / None
Type: float
tokumx.ft.alerts.longWaitEvents.cachePressure.countps

Rate at which a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
event / second
Type: float
tokumx.ft.alerts.longWaitEvents.cachePressure.timeps

Fraction of time (microseconds/second) that a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
fraction / None
Type: float
tokumx.ft.alerts.longWaitEvents.checkpointBegin.countps

Rate at which the begin checkpoint phase of checkpoint has run (these should be fairly quick).
event / second
Type: float
tokumx.ft.alerts.longWaitEvents.checkpointBegin.timeps

Fraction of time (microseconds/second) that a begin checkpoint phase has spent blocking other threads.
fraction / None
Type: float
tokumx.ft.alerts.longWaitEvents.fsync.countps

Rate at which fsync operations took more than 1 second.
event / second
Type: float
tokumx.ft.alerts.longWaitEvents.fsync.timeps

Fraction of time (microseconds/second) spent performing fsync operations that took longer than 1 second.
fraction / None
Type: float
tokumx.ft.alerts.longWaitEvents.locktreeWait.countps

Rate at which a thread had to wait more than 1 second to acquire a document-level lock in the locktree.
event / second
Type: float
tokumx.ft.alerts.longWaitEvents.locktreeWait.timeps

Fraction of time (microseconds/second) spent by threads waiting more than 1 second to acquire a document-level lock in the locktree.
fraction / None
Type: float
tokumx.ft.alerts.longWaitEvents.locktreeWaitEscalation.countps

Rate at which a thread had to wait more than 1 second to acquire a document-level lock because the locktree was at the memory limit and needed to run escalation.
event / second
Type: float
tokumx.ft.alerts.longWaitEvents.locktreeWaitEscalation.timeps

Fraction of time (microseconds/second) spent by threads waiting more than 1 second to acquire a document-level lock because the locktree was at the memory limit and needed to run escalation.
fraction / None
Type: float
tokumx.ft.alerts.longWaitEvents.logBufferWaitps

Rate at which a writing client had to wait more than 100ms for access to the log buffer.
event / second
Type: float
tokumx.ft.cachetable.evictions.full.leaf.clean.bytesps

Rate of full evictions of leaf nodes.
byte / second
Type: float
tokumx.ft.cachetable.evictions.full.leaf.clean.countps

Rate of full evictions of leaf nodes.
event / second
Type: float
tokumx.ft.cachetable.evictions.full.leaf.dirty.bytesps

Rate of full evictions of leaf nodes that need to be written back to disk.
byte / second
Type: float
tokumx.ft.cachetable.evictions.full.leaf.dirty.countps

Rate of full evictions of leaf nodes that need to be written back to disk.
event / second
Type: float
tokumx.ft.cachetable.evictions.full.leaf.dirty.timeps

Fraction of time (microseconds/second) spent performing full evictions leaf nodes, including the time spent serializing, compressing, and writing those nodes to disk.
fraction / None
Type: float
tokumx.ft.cachetable.evictions.full.nonleaf.clean.bytesps

Rate of full evictions of nonleaf nodes.
byte / second
Type: float
tokumx.ft.cachetable.evictions.full.nonleaf.clean.countps

Rate of full evictions of nonleaf nodes.
event / second
Type: float
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.bytesps

Rate of full evictions of nonleaf nodes that need to be written back to disk.
byte / second
Type: float
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.countps

Rate of full evictions of nonleaf nodes that need to be written back to disk.
event / second
Type: float
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.timeps

Fraction of time (microseconds/second) spent performing full evictions nonleaf nodes, including the time spent serializing, compressing, and writing those nodes to disk.
fraction / None
Type: float
tokumx.ft.cachetable.evictions.partial.leaf.clean.bytesps

Rate of partial evictions of leaf nodes.
byte / second
Type: float
tokumx.ft.cachetable.evictions.partial.leaf.clean.countps

Rate of partial evictions of leaf nodes.
event / second
Type: float
tokumx.ft.cachetable.evictions.partial.nonleaf.clean.bytesps

Rate of partial evictions of nonleaf nodes.
byte / second
Type: float
tokumx.ft.cachetable.evictions.partial.nonleaf.clean.countps

Rate of partial evictions of nonleaf nodes.
event / second
Type: float
tokumx.ft.cachetable.miss.countps

Rate of internal cache misses. This metric is similar to MongoDB’s btree misses and page faults.
miss / second
Type: float
tokumx.ft.cachetable.miss.full.countps

Rate of full internal cache misses.
miss / second
Type: float
tokumx.ft.cachetable.miss.full.timeps

Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for a full cache miss.
fraction / None
Type: float
tokumx.ft.cachetable.miss.partial.countps

Rate of partial internal cache misses.
miss / second
Type: float
tokumx.ft.cachetable.miss.partial.timeps

Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for a partial cache miss.
fraction / None
Type: float
tokumx.ft.cachetable.miss.timeps

Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for cache misses.
fraction / None
Type: float
tokumx.ft.cachetable.size.current

Total amount of uncompressed data currently in the database's internal cache.
byte / None
Type: float
tokumx.ft.cachetable.size.limit

Total amount of uncompressed data that will fit in TokuMX’s internal cache.
byte / None
Type: float
tokumx.ft.cachetable.size.writing

Total size of nodes that are currently queued up to be written to disk for eviction.
byte / None
Type: float
tokumx.ft.checkpoint.begin.timeps

Fraction of time (microseconds/second) that a begin checkpoint phase has spent blocking other threads.
fraction / None
Type: float
tokumx.ft.checkpoint.countps

Rate at which checkpoints are completed.
event / second
Type: float
tokumx.ft.checkpoint.lastComplete.time

The time spent, in seconds, by the most recently completed checkpoint.
second / None
Type: float
tokumx.ft.checkpoint.timeps

Fraction of time (seconds/second) spent doing checkpoints.
fraction / None
Type: float
tokumx.ft.checkpoint.write.leaf.bytes.compressedps

The rate at which leaf nodes are written to disk during checkpoints, after compression.
byte / second
Type: float
tokumx.ft.checkpoint.write.leaf.bytes.uncompressedps

The rate at which leaf nodes are written to disk during checkpoints, before compression.
byte / second
Type: float
tokumx.ft.checkpoint.write.leaf.countps

The rate at which leaf nodes are written to disk during checkpoints.
write / second
Type: float
tokumx.ft.checkpoint.write.leaf.timeps

The fraction of time spent writing leaf nodes to disk during checkpoints.
fraction / None
Type: float
tokumx.ft.checkpoint.write.nonleaf.bytes.compressedps

The rate at which nonleaf nodes are written to disk during checkpoints, after compression.
byte / second
Type: float
tokumx.ft.checkpoint.write.nonleaf.bytes.uncompressedps

The rate at which nonleaf nodes are written to disk during checkpoints, before compression.
byte / second
Type: float
tokumx.ft.checkpoint.write.nonleaf.countps

The rate at which nonleaf nodes are written to disk during checkpoints.
write / second
Type: float
tokumx.ft.checkpoint.write.nonleaf.timeps

The fraction of time spent writing nonleaf nodes to disk during checkpoints.
fraction / None
Type: float
tokumx.ft.compressionRatio.leaf

The size ratio of leaf nodes before and after compression.
fraction / None
Type: float
tokumx.ft.compressionRatio.nonleaf

The size ratio of nonleaf nodes before and after compression.
fraction / None
Type: float
tokumx.ft.compressionRatio.overall

The size ratio of nodes before and after compression.
fraction / None
Type: float
tokumx.ft.fsync.countps

The rate at which the database flushed the operating system’s file buffers to disk.
operation / second
Type: float
tokumx.ft.fsync.timeps

The fraction of time (microseconds/second) used to fsync to disk.
fraction / None
Type: float
tokumx.ft.locktree.size.current

Total memory the locktree is currently using.
byte / None
Type: float
tokumx.ft.locktree.size.limit

Maximum number of bytes that the locktree is allowed to use.
byte / None
Type: float
tokumx.ft.log.bytesps

The rate at which the logger writes to disk.
byte / second
Type: float
tokumx.ft.log.countps

The rate of of individual log writes.
write / second
Type: float
tokumx.ft.log.timeps

The fraction of time spent performing log writes.
fraction / None
Type: float
tokumx.ft.serializeTime.leaf.compressps

Fraction of time spent compressing leaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
fraction / None
Type: float
tokumx.ft.serializeTime.leaf.decompressps

Fraction of time spent decompressing leaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
fraction / None
Type: float
tokumx.ft.serializeTime.leaf.deserializeps

Fraction of time spent deserializing leaf nodes and their partitions after reading them off disk.
fraction / None
Type: float
tokumx.ft.serializeTime.leaf.serializeps

Fraction of time spent serializing leaf nodes and their partitions after reading them off disk.
fraction / None
Type: float
tokumx.ft.serializeTime.nonleaf.compressps

Fraction of time spent compressing nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
fraction / None
Type: float
tokumx.ft.serializeTime.nonleaf.decompressps

Fraction of time spent decompressing nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
fraction / None
Type: float
tokumx.ft.serializeTime.nonleaf.deserializeps

Fraction of time spent deserializing nonleaf nodes and their partitions after reading them off disk.
fraction / None
Type: float
tokumx.ft.serializeTime.nonleaf.serializeps

Fraction of time spent serializing nonleaf nodes and their partitions after reading them off disk.
fraction / None
Type: float
tokumx.mem.resident

The amount of memory currently used by the database process.
mebibyte / None
Type: float
tokumx.mem.virtual

The amount of virtual memory used by the database process.
mebibyte / None
Type: float
tokumx.metrics.document.deletedps

The number of documents deleted per second.
document / second
Type: float
tokumx.metrics.document.insertedps

The number of documents inserted per second.
document / second
Type: float
tokumx.metrics.document.returnedps

The number of documents returned by queries per second.
document / second
Type: float
tokumx.metrics.document.updatedps

The number of documents updated per second.
document / second
Type: float
tokumx.metrics.getLastError.wtime.numps

The number of getLastError operations per second with a specified write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.
operation / second
Type: float
tokumx.metrics.getLastError.wtime.totalMillisps

The number of times per second that write concern operations have timed out as a result of the wtimeout threshold to getLastError.
event / second
Type: float
tokumx.metrics.getLastError.wtimeoutsps

The fraction of time (ms/s) spent performing getLastError operations with write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.
fraction / None
Type: float
tokumx.metrics.operation.idhackps

The rate of queries that contain the _id field.
query / second
Type: float
tokumx.metrics.operation.scanAndOrderps

The rate of queries that return sorted numbers that cannot perform the sort operation using an index.
query / second
Type: float
tokumx.metrics.queryExecutor.scannedps

The rate of index items scanned during queries and query-plan evaluation.
operation / second
Type: float
tokumx.metrics.repl.apply.batches.numps

The number of batches applied across all databases per second.
operation / second
Type: float
tokumx.metrics.repl.apply.batches.totalMillisps

The fraction of time (ms/s) spent applying operations from the oplog.
fraction / None
Type: float
tokumx.metrics.repl.apply.opsps

The rate of oplog operations.
operation / second
Type: float
tokumx.metrics.repl.buffer.count

The number of operations in the oplog buffer.
operation / None
Type: float
tokumx.metrics.repl.buffer.sizeBytes

The current size of the contents of the oplog buffer.
byte / None
Type: float
tokumx.metrics.repl.network.bytesps

The rate at which data is read from the replication sync source.
byte / second
Type: float
tokumx.metrics.repl.network.getmores.numps

The rate of getmore operations.
operation / second
Type: float
tokumx.metrics.repl.network.getmores.totalMillisps

The fraction of time (ms/s) spent collecting data from getmore operations.
fraction / None
Type: float
tokumx.metrics.repl.network.opsps

The rate of operations read from the replication source.
operation / second
Type: float
tokumx.metrics.repl.network.readersCreatedps

The rate at which oplog query processes are created.
process / second
Type: float
tokumx.metrics.repl.oplog.insert.numps

The rate at which operations are inserted into the oplog.
operation / second
Type: float
tokumx.metrics.repl.oplog.insert.totalMillisps

The fraction of time (ms/s) spent inserting operations into the oplog.
fraction / None
Type: float
tokumx.metrics.repl.oplog.insertBytesps

The rate (in bytes) at which data is inserted into the oplog.
byte / second
Type: float
tokumx.metrics.ttl.deletedDocumentsps

The rate at which documents are deleted from collections with a ttl index.
document / second
Type: float
tokumx.metrics.ttl.passesps

The number of times per second the background process removes documents from collections with a ttl index.
event / second
Type: float
tokumx.opcounters.commandps

The total number of commands per second issued to the database.
command / second
Type: float
tokumx.opcounters.deleteps

The number of delete operations per second.
operation / second
Type: float
tokumx.opcounters.getmoreps

The number of getmore operations per second.
operation / second
Type: float
tokumx.opcounters.insertps

The number of insert operations per second.
operation / second
Type: float
tokumx.opcounters.queryps

The total number of queries per second.
query / second
Type: float
tokumx.opcounters.updateps

The number of update operations per second.
operation / second
Type: float
tokumx.opcountersRepl.commandps

The total number of replicated commands issued to the database per second.
command / second
Type: float
tokumx.opcountersRepl.deleteps

The number of replicated delete operations per second.
operation / second
Type: float
tokumx.opcountersRepl.getmoreps

The number of replicated getmore operations per second.
operation / second
Type: float
tokumx.opcountersRepl.insertps

The number of replicated insert operations per second.
operation / second
Type: float
tokumx.opcountersRepl.queryps

The total number of replicated queries per second.
query / second
Type: float
tokumx.opcountersRepl.updateps

The number of replicated update operations per second.
operation / second
Type: float
tokumx.stats.coll.count

The number of objects or documents in this collection.
document / None
Type: float
tokumx.stats.coll.nindexes

The number of indexes on this collection.
index / None
Type: float
tokumx.stats.coll.nindexesbeingbuilt

The number of indexes currently being built.
index / None
Type: float
tokumx.stats.coll.size

The total size in memory of all records in a collection. Does not include the record header, but does include the record’s padding. Does not include the size of any indexes associated with the collection.
byte / None
Type: float
tokumx.stats.coll.storageSize

The total amount of storage allocated to this collection for document storage.
byte / None
Type: float
tokumx.stats.coll.totalIndexSize

The total size of all indexes on this collection.
byte / None
Type: float
tokumx.stats.coll.totalIndexStorageSize

The total size on disk of all indexes on this collection (after compression).
byte / None
Type: float
tokumx.stats.dataSize

The total size of the data held in this database including the padding factor.
byte / None
Type: float
tokumx.stats.db.avgObjSize

The average size of each document.
byte / None
Type: float
tokumx.stats.db.collections

The number of collections in the database.
None / None
Type: float
tokumx.stats.db.dataSize

The total size of the data held in this database including the padding factor.
byte / None
Type: float
tokumx.stats.db.indexSize

The total size of all indexes created on this database.
byte / None
Type: float
tokumx.stats.db.indexStorageSize

The total size on disk of all indexes created on this database (after compression).
byte / None
Type: float
tokumx.stats.db.indexes

The total number of indexes across all collections in the database.
index / None
Type: float
tokumx.stats.db.objects

The number of documents in the database across all collections.
document / None
Type: float
tokumx.stats.db.storageSize

The total amount of space allocated to collections in this database for document storage.
byte / None
Type: float
tokumx.stats.idx.avgObjSize

The average size of each index entry.
byte / None
Type: float
tokumx.stats.idx.count

The number of documents in this index.
index / None
Type: float
tokumx.stats.idx.deletes

The number of delete operations performed on this index.
operation / None
Type: float
tokumx.stats.idx.inserts

The number of insert operations performed on this index.
operation / None
Type: float
tokumx.stats.idx.nscanned

The number of index entries scanned for queries using this index.
index / None
Type: float
tokumx.stats.idx.nscannedObjects

The number of collection objects examined after scanning an index entry for a query using this index.
object / None
Type: float
tokumx.stats.idx.queries

The number of query operations performed using this index.
query / None
Type: float
tokumx.stats.idx.size

The total size of this index.
byte / None
Type: float
tokumx.stats.idx.storageSize

The total size on disk of this index (after compression).
byte / None
Type: float
tokumx.stats.indexSize

The total size of all indexes created on this database.
byte / None
Type: float
tokumx.stats.indexes

The total number of indexes across all collections in the database.
index / None
Type: float
tokumx.stats.objects

The number of documents in the database across all collections.
document / None
Type: float
tokumx.stats.storageSize

The total amount of space allocated to collections in this database for document storage.
byte / None
Type: float
tokumx.uptime

The time that the tokumx process has been active.
second / None
Type: float
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Monday  —  Friday.

10am  —  6pm UK.

Dedicated Support.