Difference between revisions of "UDC/How is Usage Data Stored"

From Eclipsepedia

< UDC
Jump to: navigation, search
(String)
(Profile)
 
(5 intermediate revisions by one user not shown)
Line 7: Line 7:
 
==Profile==
 
==Profile==
  
A record is created in the profile table, ''usagedata_profile'' for each workspace we encounter. When the UDC uploads data, it provides us with two identifiers (both are UUIDs generated by the UDC) representing the user and the workspace. The ''user'' identifier effectively represents an individual computer; the ''workspace'' identifier represents an individual workspace. This distinction was deemed necessary in order to account for users who run multiple Eclipse workspaces.
+
A record is created in the profile table, ''usagedata_profile'' for each workspace we encounter. When the UDC uploads data, it provides us with two identifiers (both are UUIDs generated by the UDC) representing the user and the workspace. These identifiers are used to correlate the data. The ''user'' identifier effectively represents an individual computer; the ''workspace'' identifier represents an individual workspace. This distinction was deemed necessary in order to account for users who run multiple Eclipse workspaces. Note that the ids cannot by themselves be used to determine the identity of the user or any personal information.
  
 
A row is created in the profile table for each distinct userId/workspaceId pairing.
 
A row is created in the profile table for each distinct userId/workspaceId pairing.
Line 74: Line 74:
 
==Record==
 
==Record==
  
Multiple "record" tables are created and maintained, one for each month of gathered usage data. The table ''usagedata_record_monthly_200901'', for example, contains usage data uploaded in January 2009. We decided to separate this data into multiple tables in anticipation of having to deal with extremely large amounts of data (there are 127,680,562 records in the January 2009 table).
+
Multiple "record" tables are created and maintained, one for each month of gathered usage data. The table ''usagedata_record_monthly_200901'', for example, contains usage data '''uploaded''' in January 2009 (note that this will include data generated in the previous month). We decided to separate this data into multiple tables in anticipation of having to deal with extremely large amounts of data (there are 127,680,562 records in the January 2009 table).
  
 
This table is highly normalized. Most of the content of a usage data event record is textual. All content is stored in the "String" table (see below); only the id of the corresponding entry is stored in the record table itself.
 
This table is highly normalized. Most of the content of a usage data event record is textual. All content is stored in the "String" table (see below); only the id of the corresponding entry is stored in the record table itself.
Line 82: Line 82:
 
In order to determine the user from which an individual record was obtained, a ''Record'' table must be joined to the ''Upload'' table and then to the ''Profile'' table.
 
In order to determine the user from which an individual record was obtained, a ''Record'' table must be joined to the ''Upload'' table and then to the ''Profile'' table.
  
==Use Example==
+
===Example===
  
 
<pre>
 
<pre>
select p.userId, u.ccode, r. bundleId  
+
select  
 +
    p.userId, u.ccode, r. bundleId  
 
   from usagedata_record_monthly_200901 as r  
 
   from usagedata_record_monthly_200901 as r  
 
     join usagedata_upload as u on (r.uploadId=u.id)  
 
     join usagedata_upload as u on (r.uploadId=u.id)  
Line 94: Line 95:
 
+--------------------------------------+-------+----------+
 
+--------------------------------------+-------+----------+
 
| 0ac00d5e-7bf4-41e1-9d20-c6800f452fa9 | us    |      727 |  
 
| 0ac00d5e-7bf4-41e1-9d20-c6800f452fa9 | us    |      727 |  
+--------------------------------------+-------+----------+
+
+--------------------------------------+-------+----------+</pre>
mysql></pre>
+
 
+
  
 
===Table Description===
 
===Table Description===
Line 136: Line 135:
 
The String table, ''usagedata_string'', contains a single entry for each string representing in the ''what'', ''kind'', ''bundleId'', ''bundleVersion'', and ''description'' fields of a ''Record'' table.
 
The String table, ''usagedata_string'', contains a single entry for each string representing in the ''what'', ''kind'', ''bundleId'', ''bundleVersion'', and ''description'' fields of a ''Record'' table.
  
==Example==
+
===Example===
 
You can use a join to denormalize the data. For example:
 
You can use a join to denormalize the data. For example:
  

Latest revision as of 12:42, 15 October 2010

Contents

[edit] Overview

Usage data is aggressively normalized on the server. Ultimately, the data is stored in multiple tables.

[edit] Tables

[edit] Profile

A record is created in the profile table, usagedata_profile for each workspace we encounter. When the UDC uploads data, it provides us with two identifiers (both are UUIDs generated by the UDC) representing the user and the workspace. These identifiers are used to correlate the data. The user identifier effectively represents an individual computer; the workspace identifier represents an individual workspace. This distinction was deemed necessary in order to account for users who run multiple Eclipse workspaces. Note that the ids cannot by themselves be used to determine the identity of the user or any personal information.

A row is created in the profile table for each distinct userId/workspaceId pairing.

[edit] Table Description

The table contains an id field which is the primary key. This field is used as the target of the foreign-key reference from the Upload table.

+-------------+------------------+------+-----+---------+----------------+
| Field       | Type             | Null | Key | Default | Extra          |
+-------------+------------------+------+-----+---------+----------------+
| id          | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 
| userId      | char(40)         | NO   | MUL |         |                | 
| workspaceId | char(40)         | NO   | MUL |         |                | 
+-------------+------------------+------+-----+---------+----------------+

[edit] Sample Data

+----+--------------------------------------+--------------------------------------+
| id | userId                               | workspaceId                          |
+----+--------------------------------------+--------------------------------------+
|  1 | bbb6e4b1-391f-4da5-8ce1-a89a19cdb5ff | 7b38a3b8-a40f-4250-8be2-aa78c1e3ba9e | 
|  2 | bbb6e4b1-391f-4da5-8ce1-a89a19cdb5ff | a680e38b-aa01-4ac7-a11c-8bf457099570 | 
|  3 | bbb6e4b1-391f-4da5-8ce1-a89a19cdb5ff | 8f093d31-ba89-4e54-be98-8c7bc936587f | 
|  4 | 9d4a44d0-302b-4a00-aede-719e975ecfc8 | d4b5add9-cbc8-43cb-a6f8-696bac623429 | 
|  5 | 518e2328-faea-4f07-a46e-c0c97e050171 | 033ae773-8c3e-4240-8423-253d48be6780 | 
|  6 | 02a0ab5b-d51d-4905-ad8d-a66626da2774 | 386208d2-28a5-4ffc-a290-fdd476748857 | 
|  7 | 43d7c638-6f5a-463c-91ca-f0f7da905d4c | 7ea57f59-7f48-42c3-b563-7a1a17f22ab4 | 
|  8 | 77882409-c2c3-4d1a-9117-308fc123af70 | f369e621-2c0e-4dc9-b9c7-44208d6fc6b9 | 
|  9 | e3fe3f74-9510-41f7-a437-2997dc607b0c | 47a7e9f7-dc94-49dd-a344-92c4b321492b | 
| 10 | 04617774-515b-4c72-8327-9733bd52ebce | 9c1f5d31-cd1c-4e48-bd47-0689b58b6467 | 
+----+--------------------------------------+--------------------------------------+

[edit] Upload

Each entry in the upload table, usagedata_upload, represents an upload event. That is, every time the UDC "calls home" with an upload, we add a row to this table. Each row records the profileId of the user (from the Profile table using the userId/workspaceId combination provided with the upload event), the country code (ccode), and the time (on the server) of the upload.

[edit] Table Description

The table contains an id field which is the primary key. This field is used as the target of the foreign-key reference from a Record table.

+-----------+------------------+------+-----+-------------------+----------------+
| Field     | Type             | Null | Key | Default           | Extra          |
+-----------+------------------+------+-----+-------------------+----------------+
| id        | int(10) unsigned | NO   | PRI | NULL              | auto_increment | 
| profileId | int(10) unsigned | NO   |     |                   |                | 
| ccode     | char(2)          | YES  |     | NULL              |                | 
| time      | timestamp        | NO   | MUL | CURRENT_TIMESTAMP |                | 
+-----------+------------------+------+-----+-------------------+----------------+

[edit] Sample Data

+----+-----------+-------+---------------------+
| id | profileId | ccode | time                |
+----+-----------+-------+---------------------+
|  1 |         2 | ca    | 2008-04-04 13:17:07 | 
|  2 |         2 | ca    | 2008-04-04 15:24:18 | 
|  3 |         3 | ca    | 2008-04-04 15:26:45 | 
|  4 |         3 | ca    | 2008-04-04 15:29:37 | 
|  5 |         3 | ca    | 2008-04-04 15:30:20 | 
|  6 |         4 | us    | 2008-04-04 16:04:19 | 
|  7 |         5 | be    | 2008-04-04 16:08:36 | 
|  8 |         6 | us    | 2008-04-04 16:30:33 | 
|  9 |         7 | ca    | 2008-04-04 16:35:54 | 
| 10 |         8 | it    | 2008-04-04 16:41:02 | 
+----+-----------+-------+---------------------+

[edit] Record

Multiple "record" tables are created and maintained, one for each month of gathered usage data. The table usagedata_record_monthly_200901, for example, contains usage data uploaded in January 2009 (note that this will include data generated in the previous month). We decided to separate this data into multiple tables in anticipation of having to deal with extremely large amounts of data (there are 127,680,562 records in the January 2009 table).

This table is highly normalized. Most of the content of a usage data event record is textual. All content is stored in the "String" table (see below); only the id of the corresponding entry is stored in the record table itself.

The record table records the uploadId of the upload from which the a row originates, along with the what, kind, bundleId, bundleVersion, description, and time fields (described in What Gets Captured).

In order to determine the user from which an individual record was obtained, a Record table must be joined to the Upload table and then to the Profile table.

[edit] Example

select 
    p.userId, u.ccode, r. bundleId 
  from usagedata_record_monthly_200901 as r 
    join usagedata_upload as u on (r.uploadId=u.id) 
    join usagedata_profile as p on (u.profileId=p.id) 
  limit 1;
+--------------------------------------+-------+----------+
| userId                               | ccode | bundleId |
+--------------------------------------+-------+----------+
| 0ac00d5e-7bf4-41e1-9d20-c6800f452fa9 | us    |      727 | 
+--------------------------------------+-------+----------+

[edit] Table Description

The what, kind, bundleId, bundleVersion, and description fields are foreign key references into the String table (see below).

+---------------+---------------------+------+-----+---------+----------------+
| Field         | Type                | Null | Key | Default | Extra          |
+---------------+---------------------+------+-----+---------+----------------+
| id            | int(10) unsigned    | NO   | PRI | NULL    | auto_increment | 
| uploadId      | int(10) unsigned    | NO   | MUL |         |                | 
| what          | int(10) unsigned    | NO   |     |         |                | 
| kind          | int(10) unsigned    | NO   | MUL |         |                | 
| bundleId      | int(10) unsigned    | YES  | MUL | NULL    |                | 
| bundleVersion | int(10) unsigned    | YES  |     | NULL    |                | 
| description   | int(10) unsigned    | YES  |     | NULL    |                | 
| time          | bigint(20) unsigned | NO   |     |         |                | 
+---------------+---------------------+------+-----+---------+----------------+

[edit] Sample Data

+----+----------+------+------+----------+---------------+-------------+---------------+
| id | uploadId | what | kind | bundleId | bundleVersion | description | time          |
+----+----------+------+------+----------+---------------+-------------+---------------+
|  1 |  1083789 |    6 |   72 |      727 |             1 |         761 | 1224176378264 | 
|  2 |  1083789 |    8 |    9 |       12 |         32232 |          12 | 1224176378265 | 
|  3 |  1083789 |    8 |    9 |     1681 |         10763 |        1681 | 1224176378265 | 
|  4 |  1083789 |    8 |    9 |       21 |         10765 |          21 | 1224176378265 | 
|  5 |  1083789 |    8 |    9 |       23 |         32233 |          23 | 1224176378265 | 
|  6 |  1083789 |    8 |    9 |       25 |         10766 |          25 | 1224176378265 | 
|  7 |  1083789 |    8 |    9 |       27 |         10766 |          27 | 1224176378266 | 
|  8 |  1083789 |    8 |    9 |       10 |         10767 |          10 | 1224176378266 | 
|  9 |  1083789 |    8 |    9 |       29 |          8821 |          29 | 1224176378266 | 
| 10 |  1083789 |    8 |    9 |       31 |         10768 |          31 | 1224176378266 | 
+----+----------+------+------+----------+---------------+-------------+---------------+

[edit] String

The String table, usagedata_string, contains a single entry for each string representing in the what, kind, bundleId, bundleVersion, and description fields of a Record table.

[edit] Example

You can use a join to denormalize the data. For example:

select 
    r.id, w.string as what, k.string as kind, 
    bi.string as bundleId, bv.string as bundleVersion, 
    d.string as description, from_unixtime(r.time/1000) as time 
  from 
    usagedata_record_monthly_200901 as r 
    join usagedata_string as w on (r.what=w.id) 
    join usagedata_string as k on (r.kind=k.id) 
    join usagedata_string as bi on (r.bundleId=bi.id) 
    join usagedata_string as bv on (r.bundleVersion=bv.id) join usagedata_string as d on (r.description=d.id) 
  limit 10;
+----+-----------+-------------+----------------------------------------+---------------------------+----------------------------------------+---------------------+
| id | what      | kind        | bundleId                               | bundleVersion             | description                            | time                |
+----+-----------+-------------+----------------------------------------+---------------------------+----------------------------------------+---------------------+
|  1 | activated | perspective | org.eclipse.jst.j2ee.ui                |                           | org.eclipse.jst.j2ee.J2EEPerspective   | 2008-10-16 12:59:38 | 
|  2 | started   | bundle      | org.eclipse.osgi                       | 3.4.2.R34x_v20080826-1230 | org.eclipse.osgi                       | 2008-10-16 12:59:38 | 
|  3 | started   | bundle      | org.eclipse.equinox.simpleconfigurator | 1.0.0.v20080604           | org.eclipse.equinox.simpleconfigurator | 2008-10-16 12:59:38 | 
|  4 | started   | bundle      | org.eclipse.core.contenttype           | 3.3.0.v20080604-1400      | org.eclipse.core.contenttype           | 2008-10-16 12:59:38 | 
|  5 | started   | bundle      | org.eclipse.core.databinding           | 1.1.1.M20080827-0800b     | org.eclipse.core.databinding           | 2008-10-16 12:59:38 | 
|  6 | started   | bundle      | org.eclipse.core.expressions           | 3.4.0.v20080603-2000      | org.eclipse.core.expressions           | 2008-10-16 12:59:38 | 
|  7 | started   | bundle      | org.eclipse.core.filebuffers           | 3.4.0.v20080603-2000      | org.eclipse.core.filebuffers           | 2008-10-16 12:59:38 | 
|  8 | started   | bundle      | org.eclipse.core.filesystem            | 1.2.0.v20080604-1400      | org.eclipse.core.filesystem            | 2008-10-16 12:59:38 | 
|  9 | started   | bundle      | org.eclipse.core.jobs                  | 3.4.0.v20080512           | org.eclipse.core.jobs                  | 2008-10-16 12:59:38 | 
| 10 | started   | bundle      | org.eclipse.core.net                   | 1.1.0.I20080604           | org.eclipse.core.net                   | 2008-10-16 12:59:38 | 
+----+-----------+-------------+----------------------------------------+---------------------------+----------------------------------------+---------------------+


[edit] Table Description

New strings are added to this table as they are encountered. All entries in this table should be unique.

+--------+------------------+------+-----+---------+----------------+
| Field  | Type             | Null | Key | Default | Extra          |
+--------+------------------+------+-----+---------+----------------+
| id     | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 
| string | varchar(256)     | NO   | MUL |         |                | 
+--------+------------------+------+-----+---------+----------------+

[edit] Sample Data

+----+-----------------------------+
| id | string                      |
+----+-----------------------------+
|  1 |                             | 
|  2 | deactivated                 | 
|  3 | workbench                   | 
|  4 | org.eclipse.ui.workbench    | 
|  5 | 3.4.0.I20080205-0010        | 
|  6 | activated                   | 
|  7 | closed                      | 
|  8 | started                     | 
|  9 | bundle                      | 
| 10 | org.eclipse.core.filesystem | 
+----+-----------------------------+