Manual:Modules/TextToDatabase

From MSure

Jump to: navigation, search

TextToDatabase is a module that, when an event is received, reads one or more lines from one or more files and, based on the configuration, inserts the data into a database table. Together with the module FileWatcher this module can be used to transfer data from a third-party applications' log file into a database for further use.

Contents

Log file schemes

There are three main schemes used by applications when logging data; a static file, rolling files, and date-sorted files.

  • A static file is a file which keeps the same name and grows over time as new data is appended to it.
  • Rolling files are files which has a common base name with a suffix that changes; usually a sequence number or a date or time string.
  • Date-sorted files are a mix of the two previous variants; they have a static name, but are sorted in folders by date.

This module supports all three types with the help of the FileWatcher module.

Configuration steps

The steps to configure TextToDatabase are as follows:

  • Determine which of the three creation schemes the third-party application uses.
  • Determine the format of the file.
  • Configure an instance of the FileWatcher module to monitor the file, or files, according to the creation scheme.
  • Configure the TextToDatabase module to read the file(s) according to a specified set of rules which are based on the format of the log files and the creation scheme.

Configuration template

<ModuleConfig xmlns:xi="http://www.w3.org/2001/XInclude" xml:base=".">
   <CoreConfig>
      <Alias>ChangeToYourOwnAlias!</Alias>
      <ClassName>TextToDatabase</ClassName>     
   </CoreConfig>
   <TextToDatabase>
      <TextToDatabaseItem UseInsertUpdate="true" LineCount="10">
         <DatabaseConnection>StatlinkDBConnection</DatabaseConnection>
         <DatabaseTable>
                  <Name>{name of table, or format string}</Name>
                  <TableDefinition>
                           <Column Name="datetime" Type="DateTime" AutoIncrement="false" AllowNull="false" PrimaryKey="true"/>
                           <Column Name="value" Type="Double" AutoIncrement="false" AllowNull="false" PrimaryKey="false"/>
                  </TableDefinition>
         </DatabaseTable>
         <FilePath>path to file, or directory</FilePath>
         <LineFormat><![CDATA[formatStringGoesHere]]></LineFormat>
         <PathFormat><![CDATA[formatStringGoesHere]]></PathFormat>
      </TextToDatabaseItem>
   </TextToDatabase>
</ModuleConfig>

Configuration elements

The FileWatcherItem element supports two mandatory attributes.

Attribute namePossible valuesDescription
LineCount1...1000Tells the module how many rows it should read at a time
UseInsertUpdatetrue or falseIf true, then instead of considering it an error when a duplicate key is encountered in the database,

update the already existing values with those currently contained in the executing statement.

  • DatabaseConnection element is mandatory. It specifies which database connection to use.
  • DatabaseTable element specifies into which database table the values read from the file should be placed. Using the same format strings as in LineFormat and PathFormat, it is possible to extract parts of the file path and/or data to be used as the table name. The column element must be named “databasetable” in any format string that will make up the table name. The data to match against is the entire file path with an appended ';' and the data itself. See Retrieving parts of the path and file name. Also see section “Format strings”. NOTE: The DatabaseTable name is parsed once each batch of lines, so if you have a table name that may change on each line, you must set the LineCount-attribute to 1.
Note: If the value of this setting contains a '{' and '}' it is considered to be a format string and not a simple table name.
  • FilePath element specifies the path and file name of the file to read. This setting is case insensitive. By using the normal wild card characters '*' (matches any number of characters) and '?' (matches a single character), it is possible to match several path- and file names.
    • Example: c:\path\2*\*.txt
      • Matches any txt-file within a directory that starts with a '2' and resides in the directory c:\path.
    • c:\path\2*\???.txt
      • Would match the same directories, but only a file with three characters preceding the extension.
  • LineFormat element specifies the format string to use when parsing the lines read from the file. See section “Format strings” below.
  • PathFormat element specifies the format string to use when retrieving data from the file path and file name. This setting is optional and may be left out. See section “Format strings” below.

Format strings

When reading a line from a file the module must know the format of the line, and into which field of the database table to place the data. This is what the format strings are for. This module comes with a set of predefined format strings, as shown in the table below. Note that the curly braces are a part of the format string and shall not be removed.

For each format string, a column name may optionally be specified to tell the module where to insert the data, as indicated below with the "<col>" string. See "Merging fields" below for details about the <ord> (order)-tag. The third tag, <ins>, allows you to insert some extra characters at the end of the data read from the file (or file path).

If any of the column name, order or insert tags are omitted, the characters "<" and ">" must still be included in the format string. There is never a space between the elements in a format string. All format strings are case sensitive and the format string must match the entire line that it will be matched against; add leading and ending patterns as required.

Special XML considerations

Due to the fact that the format strings uses characters reserved for XML, the format strings must be enclosed within a CDATA structure:

<element><![CDATA[Your format string goes here]]></element>

Example

<LineFormat><![CDATA[{HH:MM:SS<datetime><3><>}{char=','rep=''<><><>}{float<value><><>}]]></LineFormat>

Merging fields

It is possible to merge two or more fields into a single column in the database. To do this, the same column shall be specified for the relevant format strings. The module will concatenate the values as strings; "2008-01-01" and "20:00:00" will result in the string "2008-01-01 20:00:00" (separated by a space character specified in the format string). This is especially useful when using columns of type "datetime" in the database table. Normally fields with the same column name are merged left-to-right, but if this is not sufficient, the fields can be reordered using the <ord> (order)-tag. Specify a number in the tag; they will be merged from lowest to highest number.

Format strings

Format stringExampleDescription
{YYYY-MM-DD<col><ord><ins>}2008-02-02Matches a date string, separated by the character '-'
{YYYY/MM/DD<col><ord><ins>}2008/02/02Matches a date string, separated by the character '/'
{HH:MM:SS<col><ord><ins>}12:01:45Matches a time string, separated by the character ':'
{string[n:m]<col><ord><ins>}A string valueMatches a string literal, with a length of n to m characters, inclusive. n ≥ 0, m ≥ n.

n and m can be omitted to match a string of unspecified length. If m is omitted, this pattern will match as many characters as possible, until the next pattern is found.

{integer<col><ord><ins>}854325Matches a signed 32-bit integer value
{long<col><ord><ins>}455558187458Matches a signed 64-bit integer value
{float<col><ord><ins>}-13,5Matches a 32-bit floating point value. Either a ',' or a '.' can be used as a comma character.
{char='c'rep=''<col><ord><ins>}<any character>Matches a single character, optionally replacing it with another.
{inserttime<col><ord><ins>}When no time is available, this format string can be used to insert the current time.
{insertdate<col><ord><ins>}When no date is available, this format string can be used to insert the current date.
{insertdatetime<col><ord><ins>When no time or date are available, this format string can be used to insert the current date and time.

Example

Given the string "2008-01-09;23:30:00;abcdef;288;22.48", the format line would read:(written on a single line):

{YYYY-MM-DD<mydate><>< >}{char=';'rep=<><><>}{HH:MM:SS<mydate>}{string[:]<><><>}
 {char='; 'rep=<><><>}{integer<value_column><><>}{char=';'rep=<><><>}{float<column><><>}

This would merge the date and time into the "mydate" column (with a space in between), and place value 288 in column "value" and value 22.48 in the column "other value". The separators and the string "abcdef" would be ignored.

{insertdate<datetime><0>< >}{HH:MM:SS<datetime><1><>}{string[1:1]<><><>}{float<value><><>}

This would merge the time present in the data ("12:00:13 90.9") with the current date into column "datetime"

Retrieving parts of the path and file name

By specifying a PathFormat it is possible to merge parts of the file path and file name with the data read from the file itself. The same format strings are used. If PathFormat is specified, the complete file path and the line read from the file are concatenated (with a separating “;”), path first, creating a single string to be parsed.

Example

For the sake of this example we assume that the day and month are switched in the path, so given the path "c:\path\2008\03\02\temperature.log" (date is actually 2008-02-03), and a line similar to the one in the previous example (without the date) would give the following string: "c:\path\2008\03\02\temperature.log;23:30:00;abcdef;288;22.48" Merging the date from the path and the time from the file would give us the same result as in the previous example. The format stings would look like this (written on a single line):

PathFormat
{string[:]<><><>}{integer<mydate><0><->}{char='\'rep=''<><><>}{integer<mydate><3><>}
 {char='\'rep=’’<><><>}{integer<mydate><2><>}{string[:]<><>}{char='; 'rep=’’<><><>}
LineFormat
{HH:MM:SS<mydate><5><>}{string[:]<><><>}{char=';'rep=''<><><>}{integer<value>}{char=';'rep=''<><><>}{float<column><><>}

Note how the month and day format strings are numbered; the format string for the month has a lower order number than the one for the day. Also note that the format string for the time in the LineFormat has an order number of five, placing it after the date. Further, note the character expression at the end of the PathFormat.

Splitting a line into multiple tables and/or databases

It is possible to split a line into multiple databases. To do this, an instance of this module for each split must be configured using an alternative alias. Then for each of these aliases a set of database connection details must be specified in the configuration of the DatabaseConnector module. To split a line into multiple tables within the same database is easier; just specify a new section with the same settings, except for the DatabaseTable and LineFormat (unless you want to use duplicate the same data to a second table)

Mixing format strings with real regular expressions

It is possible to mix format strings with real regular expressions. This is not required, but can make the expressions less complex. Note that it is not possible to use the column, order, or insert tags in format strings that are mixed with real regular expressions. The resulting behavior of doing so is undefined.

An example of when a mix can be used is when extracting the table name from the file path.

Instead of writing this long expression

{string[:]<><><>}{char='\'rep=''<><><>}{integer<><><>}{char='\'rep=''<><><>}{integer<><><>}{char='\'rep=''<><><>}{integer<><><>}
 {char='\'rep=''<><><>}{string[:]<databasetable><><>}{char='.'rep=''<><><>}{string[:]<><><>}

this one can be used instead:

({string[:]<><><>}{char='\'rep=''<><><>}){1,}{string[:]<databasetable><><>}{char='.'rep=''<><><>}{string[:]<><><>}

What's been done is that we've told the module that "a string of unknown length ending with a '\' character" will appear one or more times (written as {1,}) at the start of the string followed by a string, a '.' character and another string of unknown length.

Taking it further, the final format string can look like:

(.{1,}\\{1}){1,}{string[:]<databasetable><><>}{char='.'rep=''<><><>}.{1,}

This manual does not explain regular expressions further, but there is a lot of information available on the Internet. A good resource is MSDN

Track files

For each file the module reads, it keeps a record of the last position by saving it to a track file, located in a sub directory named TrackFiles in the application's installation folder. The format of the file name is in close semblance to the monitored files’ full path, with an extension of .track. If these files are deleted, the module will reparse the entire file if it detects a file change, possibly resulting in errors when the data is to be inserted into the database so only delete those files which you know for certain won’t be changed. Track files are deleted automatically after three days.