Difference between revisions of "Importing Data from Files"

From Eclipsepedia

Jump to: navigation, search
 
(26 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Importing Data From External Files
+
[[Image:STEM TOP BAR.gif]]
  
'''Overview:'''
+
'''[http://wiki.eclipse.org/Tutorials_for_Developers Back]'''
  
STEM also allows users to import data from one or more external files and to play them back. Today, import is supported for Comma Separated Variable (CSV) files.
+
'''Overview'''
+
  
'''Organizing the Data Files:'''
+
In order to actually import the data into STEM, you must create a scenario. The external data will be read by adding a special disease model or population model called a "External Data Source Playback" to your scenario. Follow the instructions for creating a scenario. Your scenario must contain a  model and graph that contain the same set of the regions named in your collection of data files. So if you want to play back data on US counties, you must add all the US counties to your model.
 +
 
 +
STEM allows you to import data from one or more external files and to play them back. You can either replay data for a population (e.g. daily population counts with births and deaths) or for a disease (e.g. daily counts for each disease compartment). Replaying data files that were generated by the STEM CSV logger is straightforward and gives you the flexibility of replaying data for any type of disease or population model. If you have data from other sources, you need to convert the data to the format STEM supports.
 +
 
 +
'''Importing data generated by the STEM CSV logger'''
 +
 +
When importing data that was generated by STEM, you create an instance of "External Data Source Playback" in either the wizard for creating new disease model (if you want to import disease data) or the wizard for creating a new population model (if you want to import population data). For the "Data Path" parameter, enter the path to the folder containing the "runparameters.csv" file as well as the "decorator.XMI" file as in this screenshot:
  
The individual files or collections of files for a particular scenario should be grouped in a single folder. In figure 1a we show a collection of files, all containing Shigella data, grouped in
+
[[Image:STEM_Importscenario0.jpg|800px]]
a folder called "ScenarioShigella". This folder can  have any name and be placed in any location. In the example below we place the folder in the same location that STEM
+
typically exports data to, namely:
+
  
  ... runtime-stem.product\.metadata\.plugins\org.eclipse.ohf.stem.diseasemodels\csv\
+
The "decorator.XMI" file is important because it tells STEM how to interpret the rest of the logged .csv data files and what type of labels and label values to create in your graph to store the data. You can even import data for multiple populations at the same time, as is the case in the screenshot above (anopheles and humans).
  
Where your runtime-stem.product\ is typically located n  C:\runtime-stem.product\ in a default Eclipse installation.
+
Click finish to create your new external data source playback disease or population model and drag it into your scenario models just like you do any other decorator.  
  
To import data, it is necessary that a CSV file be created for each location. Each file name should be the unique stem ID for the location where the data originates. For example,
+
'''Importing data from other sources'''
in Figure 1a we show several files containing data from different locations in Israel, Jordan, and Palestine. The file extension should be .txt or .csv.
+
  
For information on how to export data, please see the documentation on the CSV Logger View.
+
Before you can import data from other types of sources, you need to convert it to the .CSV file format that STEM understands. The individual files or collections of files for a particular scenario should be grouped in a single folder. In figure 1a we show a collection of files, all containing Shigella data, grouped in a folder called "ScenarioShigella". This folder can  have any name and be placed in any location. In the example below we place the folder in the same location that STEM typically exports data to, namely:
  
[[Image:Importscenario1a.jpg|650px]]
+
  ... runtime-stem.product\<project>\Recorded Simulations\
  
'''Figure 1a: Organizing your data to prepare to import into STEM.'''
+
This is where your runtime-stem.product\ is typically located on  C:\runtime-stem.product\ in a default Eclipse installation.
  
Each comma separate variable file should contain a header indicating the type of data in each column. The first column should contain a sequential iteration or row
+
To import disease model data, you need to create a CSV file be created for each state modeled by the disease. Each file name should be the disease state identifier (one of S, E, I or R) followed by an underscore character (_) followed by a number. The number indicates the geographical resolution for locations contained in the file. For instance, 2 represents locations at administration level 2 (US counties). In Figure 1a, we show several files containing data for the S, E, I and R states at administration level 3. The file extension should be .csv.
number. The Second column should contain the time the data applies to (each row should be sequential in time). The following columns should indicate the total
+
count (individuals) in different states. If the csv file was generated by STEM (i.e. by EXPORTING from a previous run) then there will be one column for every
+
disease state used in the model that generate the data (e.g., S,I,R or S,E,I,R). If the data is from syndromic surveillance users will likely not have information on
+
every possible state but when importing you must have columns and column header for each state you plan to display (for example, SI). Note that the data in each
+
column must be total count (individuals) and not fraction in each state. The numbers may be floating point (allowing fractional people infected for example).
+
  
 +
[[Image:Importscenario1a.jpg|800px]]
  
Table 1
+
'''Figure 1a: Organizing Your Data to Prepare to Import into STEM.'''
  
In order to actually import the data into STEM you must create a scenario. The external data will be read by adding a special disease model called a "ExternalDataSourceDiseaseModel" to your scenario. Follow the instructions for creating a scenario. Your scenario must contain a  model and graph that contains the same set of the regions named in your collection of data files. So if you want to play back data on US counties, you must add all the US counties to your model.
+
'''Data Format'''
  
 +
Each CSV file should contain a header indicating the domain of data in each column. The first column should contain a sequential iteration or row number and the header label must be 'iteration'. The second column should contain the time the data applies to (each row should be sequential in time) with header label 'time'. The following column headers should indicate the unique STEM location ID that the data in the column belongs to; the values in each row are the count (individuals) at the location in the state represented by the file. If the data is from syndromic surveillance, you are likely not to have information on every possible state; but when importing, you must have files and file headers for each state that you want to handle. STEM determines the type of the disease by checking which files are available when importing data. For example, if STEM finds an S_2.csv and I_2.csv file but no E_2.csv or R_2.csv file, it will assume  the disease is of type SI. Note that the data in each column must be the total count (individuals) and not the fraction in each state. The numbers may be floating point (allowing fractional people infected for example).
  
When you are ready to add the '''ExternalDataSourceDiseaseModel''' to your scenario (under the model node), click on the icon for adding a new disease. specify your project and give the disease a name. Select  '''ExternalDataSourceDiseaseModel''' as the disease model. You must then tell the '''ExternalDataSourceDiseaseModel''' the location of your data file(s). You may use the selector buttons to the right of Data Path. A "Select Directory" dialog (figure 1c) will appear allowing you to select either an individual file (if you have data for only one location) or a Directory that contains the collection you wish to play back.
+
'''Table 1: The CSV file must label the locations you plan to import into STEM (see text).'''
 +
iteration, time, US-VT-50013-67000,US-VT-50027-77500,US-VT-50021-16825,US-VT-50021-75925
 +
0,  Thu Dec 13 18:32:58 PST 2007 , 253175, 1.45, 1.45, 0.0
 +
  1,  Thu Dec 13 18:32:59 PST 2007 , 253110, 3.34, 3.34, 0.0
 +
                            ...
  
 +
If your external data consist of population data (e.g. daily estimates of mosquito capacity for Thailand), your directory must contain at least a single file Count_1.csv with daily population counts. You can use the population data to drive a disease model (e.g. malaria). STEM assumes that when the population is growing it is due to births exclusively (no deaths) and when the population is shrinking it is due to deaths (no births). If you want to also import data on daily births and deaths for a population, include the files Births_1.csv and Deaths_1.csv in the same folder. STEM makes sure that the reported births and deaths agree with the total count reported and, if not, adjusts the births and deaths to ensure this is the case.
  
 +
When you are ready to add the '''External Data Source Replay''' disease or population model to your scenario (under the model node), click on the icon for adding a new disease model (or new population model). Specify your project and give the disease a name. Select  '''External Data Source Replay''' as the model in the drop-down. You must then specify the location of your data file(s). You may use the selector button to the right of Data Path. A "Select Directory" dialog (figure 1b) will appear allowing you to select a directory that contains the data files you wish to play back.
  
[[Image:Importscenario1b.jpg|400px]]
 
  
'''Figure 1b: Creating an  ExternalDataSourceDiseaseModel.'''
+
[[Image:Importscenario1b.jpg|800px]]
  
[[Image:Importscenario1c.jpg|400px]]
+
'''Figure 1b: Creating an  External Data Source Replay model for population data.'''
  
'''Figure 1c: Select the Directory that Contains your data.'''
 
  
 +
Once you have created a scenario set up to replay  from an external file, you can also edit the data path using the Properties Editor.  Just go to your project, find the  External Data Source Replay model you created under the "decorators" folder, and double click on it in the Resource Set window (See figure 2). The Editor will show up allowing you to change this path used to read in data.
  
Once you have created a scenario set up to Replay a disease from an external file you can also edit the data path using the Properties Editor.  Just go to your project, find the  ExternalDataSourceDiseaseModel you created
+
[[Image:Importscenario2.jpg|600px]]
under the "decorators" folder, and  double click on it in the Resource Set window (See figure 2). The Editor will show up allowing you to change this path used to read in data.
+
  
Figure 2
+
'''Figure 2: Changing the Data Path in the Properties Editor.'''
  
  
Once you have completed setting up your scenario, Save your work at this point by hitting ctrl-s
+
Once you have completed setting up your scenario, save your work at this point by hitting ctrl-s. To replay your data, select the scenario you created in the STEM Project Explorer, right click and select run. STEM will launch, load the locations you specified, and play back your data. It does this efficiently by streaming one row at a time from the file system. This ensures that you can import and play back data for large scenarios, e.g. global simulations.
To replay your data, simply select the scenario you created in the STEM project explorer, right click, and select run. STEM will launch, load the locations you specified
+
and play back your data.
+

Latest revision as of 10:18, 22 March 2012

STEM TOP BAR.gif

Back

Overview

In order to actually import the data into STEM, you must create a scenario. The external data will be read by adding a special disease model or population model called a "External Data Source Playback" to your scenario. Follow the instructions for creating a scenario. Your scenario must contain a model and graph that contain the same set of the regions named in your collection of data files. So if you want to play back data on US counties, you must add all the US counties to your model.

STEM allows you to import data from one or more external files and to play them back. You can either replay data for a population (e.g. daily population counts with births and deaths) or for a disease (e.g. daily counts for each disease compartment). Replaying data files that were generated by the STEM CSV logger is straightforward and gives you the flexibility of replaying data for any type of disease or population model. If you have data from other sources, you need to convert the data to the format STEM supports.

Importing data generated by the STEM CSV logger

When importing data that was generated by STEM, you create an instance of "External Data Source Playback" in either the wizard for creating new disease model (if you want to import disease data) or the wizard for creating a new population model (if you want to import population data). For the "Data Path" parameter, enter the path to the folder containing the "runparameters.csv" file as well as the "decorator.XMI" file as in this screenshot:

STEM Importscenario0.jpg

The "decorator.XMI" file is important because it tells STEM how to interpret the rest of the logged .csv data files and what type of labels and label values to create in your graph to store the data. You can even import data for multiple populations at the same time, as is the case in the screenshot above (anopheles and humans).

Click finish to create your new external data source playback disease or population model and drag it into your scenario models just like you do any other decorator.

Importing data from other sources

Before you can import data from other types of sources, you need to convert it to the .CSV file format that STEM understands. The individual files or collections of files for a particular scenario should be grouped in a single folder. In figure 1a we show a collection of files, all containing Shigella data, grouped in a folder called "ScenarioShigella". This folder can have any name and be placed in any location. In the example below we place the folder in the same location that STEM typically exports data to, namely:

  ... runtime-stem.product\<project>\Recorded Simulations\

This is where your runtime-stem.product\ is typically located on C:\runtime-stem.product\ in a default Eclipse installation.

To import disease model data, you need to create a CSV file be created for each state modeled by the disease. Each file name should be the disease state identifier (one of S, E, I or R) followed by an underscore character (_) followed by a number. The number indicates the geographical resolution for locations contained in the file. For instance, 2 represents locations at administration level 2 (US counties). In Figure 1a, we show several files containing data for the S, E, I and R states at administration level 3. The file extension should be .csv.

Importscenario1a.jpg

Figure 1a: Organizing Your Data to Prepare to Import into STEM.

Data Format

Each CSV file should contain a header indicating the domain of data in each column. The first column should contain a sequential iteration or row number and the header label must be 'iteration'. The second column should contain the time the data applies to (each row should be sequential in time) with header label 'time'. The following column headers should indicate the unique STEM location ID that the data in the column belongs to; the values in each row are the count (individuals) at the location in the state represented by the file. If the data is from syndromic surveillance, you are likely not to have information on every possible state; but when importing, you must have files and file headers for each state that you want to handle. STEM determines the type of the disease by checking which files are available when importing data. For example, if STEM finds an S_2.csv and I_2.csv file but no E_2.csv or R_2.csv file, it will assume the disease is of type SI. Note that the data in each column must be the total count (individuals) and not the fraction in each state. The numbers may be floating point (allowing fractional people infected for example).

Table 1: The CSV file must label the locations you plan to import into STEM (see text).
iteration, time, US-VT-50013-67000,US-VT-50027-77500,US-VT-50021-16825,US-VT-50021-75925
0,   Thu Dec 13 18:32:58 PST 2007 , 253175, 1.45, 1.45, 0.0
1,   Thu Dec 13 18:32:59 PST 2007 , 253110, 3.34, 3.34, 0.0
                            ...

If your external data consist of population data (e.g. daily estimates of mosquito capacity for Thailand), your directory must contain at least a single file Count_1.csv with daily population counts. You can use the population data to drive a disease model (e.g. malaria). STEM assumes that when the population is growing it is due to births exclusively (no deaths) and when the population is shrinking it is due to deaths (no births). If you want to also import data on daily births and deaths for a population, include the files Births_1.csv and Deaths_1.csv in the same folder. STEM makes sure that the reported births and deaths agree with the total count reported and, if not, adjusts the births and deaths to ensure this is the case.

When you are ready to add the External Data Source Replay disease or population model to your scenario (under the model node), click on the icon for adding a new disease model (or new population model). Specify your project and give the disease a name. Select External Data Source Replay as the model in the drop-down. You must then specify the location of your data file(s). You may use the selector button to the right of Data Path. A "Select Directory" dialog (figure 1b) will appear allowing you to select a directory that contains the data files you wish to play back.


Importscenario1b.jpg

Figure 1b: Creating an External Data Source Replay model for population data.


Once you have created a scenario set up to replay from an external file, you can also edit the data path using the Properties Editor. Just go to your project, find the External Data Source Replay model you created under the "decorators" folder, and double click on it in the Resource Set window (See figure 2). The Editor will show up allowing you to change this path used to read in data.

Importscenario2.jpg

Figure 2: Changing the Data Path in the Properties Editor.


Once you have completed setting up your scenario, save your work at this point by hitting ctrl-s. To replay your data, select the scenario you created in the STEM Project Explorer, right click and select run. STEM will launch, load the locations you specified, and play back your data. It does this efficiently by streaming one row at a time from the file system. This ensures that you can import and play back data for large scenarios, e.g. global simulations.