{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "e915105d",
"metadata": {},
"source": [
"# Call and Message Data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "935d0c5a",
"metadata": {},
"source": [
"## 1. Introduction\n",
"\n",
"In `niimpy`, communication data includes calls and SMS information. These data can reveal important information about people's circadian rhythm, social patterns, and activity, just to mention a few. Therefore, it is important to organize this information for further processing and analysis. To address this, `niimpy` includes a set of functions to clean, downsample, and extract features from communication data.\n",
"\n",
"A communication data dataframe should include the following columns (column names can be different, but in that case they must be provided as parameters):\n",
"- `user`: Subject ID\n",
"\n",
"Required for calls:\n",
"- `call_duration`: The duration of a call\n",
"- `call_type`: Type of a call, \"incoming\", \"outgoing\" or \"missed\"\n",
"\n",
"Required for messages:\n",
"- `message_type`: Type of a message, \"incoming\" or \"outgoing\"\n",
"\n",
"\n",
"The available features are:\n",
"- `call_duration_total`: duration of incoming and outgoing calls\n",
"- `call_duration_mean`: mean duration of incoming and outgoing calls\n",
"- `call_duration_median`: median duration of incoming and outgoing calls\n",
"- `call_duration_std`: standard deviation of incoming and outgoing calls\n",
"- `call_count`: number of calls within a time window\n",
"- `call_outgoing_incoming_ratio`: number of outgoing calls divided by the number of incoming calls\n",
"- `sms_count`: count of incoming and outgoing text messages\n",
"- `extract_features_comms`: wrapper to extract several features at the same time\n",
"\n",
"In the following, we will analyze call logs provided by `niimpy` as an example to illustrate the use of niimpy's communication preprocessing functions."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ca2e11f7",
"metadata": {},
"source": [
"## 2. Read data\n",
"\n",
"Let's start by reading the example data provided in `niimpy`. These data have already been shaped in a format that meets the requirements of the data schema. Let's start by importing the needed modules. Firstly we will import the `niimpy` package and then we will import the module we will use (communication) and give it a short name for use convinience. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ec3bfcce",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import niimpy\n",
"import niimpy.preprocessing.communication as com \n",
"from niimpy import config\n",
"import pandas as pd\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "134eb3d1",
"metadata": {},
"source": [
"Now let's read the example data provided in `niimpy`. The example data is in `csv` format, so we need to use the `read_csv` function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument **tz**. The output is a dataframe. We can also check the number of rows and columns in the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3721a0b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(38, 6)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')\n",
"data.shape"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0524a3f5",
"metadata": {},
"source": [
"The data was succesfully read. We can see that there are 38 datapoints with 6 columns in the dataset. However, we do not know yet what the data really looks like, so let's have a quick look:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "732a945e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" call_type | \n",
" call_duration | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 02:08:03.895999908+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" incoming | \n",
" 1079 | \n",
" 2020-01-09 02:08:03.895999908+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:49:44.969000101+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578531e+09 | \n",
" outgoing | \n",
" 174 | \n",
" 2020-01-09 02:49:44.969000101+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:22:57.168999910+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" outgoing | \n",
" 890 | \n",
" 2020-01-09 02:22:57.168999910+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:27:21.187000036+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" outgoing | \n",
" 1342 | \n",
" 2020-01-09 02:27:21.187000036+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:47:16.177000046+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578531e+09 | \n",
" incoming | \n",
" 645 | \n",
" 2020-01-09 02:47:16.177000046+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:49:44.969000101+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578531e+09 \n",
"2020-01-09 02:22:57.168999910+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:27:21.187000036+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"2020-01-09 02:47:16.177000046+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578531e+09 \n",
"\n",
" call_type call_duration \\\n",
"2020-01-09 02:08:03.895999908+02:00 incoming 1079 \n",
"2020-01-09 02:49:44.969000101+02:00 outgoing 174 \n",
"2020-01-09 02:22:57.168999910+02:00 outgoing 890 \n",
"2020-01-09 02:27:21.187000036+02:00 outgoing 1342 \n",
"2020-01-09 02:47:16.177000046+02:00 incoming 645 \n",
"\n",
" datetime \n",
"2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n",
"2020-01-09 02:49:44.969000101+02:00 2020-01-09 02:49:44.969000101+02:00 \n",
"2020-01-09 02:22:57.168999910+02:00 2020-01-09 02:22:57.168999910+02:00 \n",
"2020-01-09 02:27:21.187000036+02:00 2020-01-09 02:27:21.187000036+02:00 \n",
"2020-01-09 02:47:16.177000046+02:00 2020-01-09 02:47:16.177000046+02:00 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ae0a00c4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" call_type | \n",
" call_duration | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-12 22:10:21.503999949+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565637e+09 | \n",
" incoming | \n",
" 715 | \n",
" 2019-08-12 22:10:21.503999949+03:00 | \n",
"
\n",
" \n",
" | 2019-08-12 22:27:19.923000097+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565638e+09 | \n",
" outgoing | \n",
" 225 | \n",
" 2019-08-12 22:27:19.923000097+03:00 | \n",
"
\n",
" \n",
" | 2019-08-13 07:01:00.960999966+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565669e+09 | \n",
" outgoing | \n",
" 1231 | \n",
" 2019-08-13 07:01:00.960999966+03:00 | \n",
"
\n",
" \n",
" | 2019-08-13 07:28:27.657999992+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565671e+09 | \n",
" incoming | \n",
" 591 | \n",
" 2019-08-13 07:28:27.657999992+03:00 | \n",
"
\n",
" \n",
" | 2019-08-13 07:21:26.436000109+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565670e+09 | \n",
" outgoing | \n",
" 375 | \n",
" 2019-08-13 07:21:26.436000109+03:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2019-08-12 22:10:21.503999949+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565637e+09 \n",
"2019-08-12 22:27:19.923000097+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565638e+09 \n",
"2019-08-13 07:01:00.960999966+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565669e+09 \n",
"2019-08-13 07:28:27.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 \n",
"2019-08-13 07:21:26.436000109+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565670e+09 \n",
"\n",
" call_type call_duration \\\n",
"2019-08-12 22:10:21.503999949+03:00 incoming 715 \n",
"2019-08-12 22:27:19.923000097+03:00 outgoing 225 \n",
"2019-08-13 07:01:00.960999966+03:00 outgoing 1231 \n",
"2019-08-13 07:28:27.657999992+03:00 incoming 591 \n",
"2019-08-13 07:21:26.436000109+03:00 outgoing 375 \n",
"\n",
" datetime \n",
"2019-08-12 22:10:21.503999949+03:00 2019-08-12 22:10:21.503999949+03:00 \n",
"2019-08-12 22:27:19.923000097+03:00 2019-08-12 22:27:19.923000097+03:00 \n",
"2019-08-13 07:01:00.960999966+03:00 2019-08-13 07:01:00.960999966+03:00 \n",
"2019-08-13 07:28:27.657999992+03:00 2019-08-13 07:28:27.657999992+03:00 \n",
"2019-08-13 07:21:26.436000109+03:00 2019-08-13 07:21:26.436000109+03:00 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.tail()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "94e5d49e",
"metadata": {},
"source": [
"By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:\n",
"\n",
"- rows are observations, indexed by timestamps, i.e. each row represents a call that was received/done/missed at a given time and date\n",
"- columns are characteristics for each observation, for example, the user whose data we are analyzing\n",
"- there are at least two different users in the dataframe\n",
"- there are two main columns: `call_type` and `call_duration`. In this case, the `call_type` columns stores information about whether the call was *incoming*, *outgoing* or *missed*; and the `call_duration` contains the duration of the call in seconds\n",
"\n",
"In fact, we can check the first three elements for each user"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7c24f199",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" call_type | \n",
" call_duration | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 02:08:03.895999908+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" incoming | \n",
" 1079 | \n",
" 2020-01-09 02:08:03.895999908+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:49:44.969000101+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578531e+09 | \n",
" outgoing | \n",
" 174 | \n",
" 2020-01-09 02:49:44.969000101+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:22:57.168999910+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" outgoing | \n",
" 890 | \n",
" 2020-01-09 02:22:57.168999910+02:00 | \n",
"
\n",
" \n",
" | 2019-08-08 22:32:25.256999969+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565293e+09 | \n",
" incoming | \n",
" 1217 | \n",
" 2019-08-08 22:32:25.256999969+03:00 | \n",
"
\n",
" \n",
" | 2019-08-08 22:53:35.107000113+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565294e+09 | \n",
" incoming | \n",
" 383 | \n",
" 2019-08-08 22:53:35.107000113+03:00 | \n",
"
\n",
" \n",
" | 2019-08-08 22:31:34.539999962+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565293e+09 | \n",
" incoming | \n",
" 1142 | \n",
" 2019-08-08 22:31:34.539999962+03:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:49:44.969000101+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578531e+09 \n",
"2020-01-09 02:22:57.168999910+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2019-08-08 22:32:25.256999969+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565293e+09 \n",
"2019-08-08 22:53:35.107000113+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565294e+09 \n",
"2019-08-08 22:31:34.539999962+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565293e+09 \n",
"\n",
" call_type call_duration \\\n",
"2020-01-09 02:08:03.895999908+02:00 incoming 1079 \n",
"2020-01-09 02:49:44.969000101+02:00 outgoing 174 \n",
"2020-01-09 02:22:57.168999910+02:00 outgoing 890 \n",
"2019-08-08 22:32:25.256999969+03:00 incoming 1217 \n",
"2019-08-08 22:53:35.107000113+03:00 incoming 383 \n",
"2019-08-08 22:31:34.539999962+03:00 incoming 1142 \n",
"\n",
" datetime \n",
"2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n",
"2020-01-09 02:49:44.969000101+02:00 2020-01-09 02:49:44.969000101+02:00 \n",
"2020-01-09 02:22:57.168999910+02:00 2020-01-09 02:22:57.168999910+02:00 \n",
"2019-08-08 22:32:25.256999969+03:00 2019-08-08 22:32:25.256999969+03:00 \n",
"2019-08-08 22:53:35.107000113+03:00 2019-08-08 22:53:35.107000113+03:00 \n",
"2019-08-08 22:31:34.539999962+03:00 2019-08-08 22:31:34.539999962+03:00 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop_duplicates(['user','call_duration']).groupby('user').head(3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c33fe622",
"metadata": {},
"source": [
"Sometimes the data may come in a disordered manner, so just to make sure, let's order the dataframe and compare the results. We will use the columns \"user\" and \"datetime\" since we would like to order the information according to firstly, participants, and then, by time in order of happening. Luckily, in our dataframe, the index and datetime are the same."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c1cd4baf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" call_type | \n",
" call_duration | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-08 22:31:34.539999962+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565293e+09 | \n",
" incoming | \n",
" 1142 | \n",
" 2019-08-08 22:31:34.539999962+03:00 | \n",
"
\n",
" \n",
" | 2019-08-08 22:32:25.256999969+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565293e+09 | \n",
" incoming | \n",
" 1217 | \n",
" 2019-08-08 22:32:25.256999969+03:00 | \n",
"
\n",
" \n",
" | 2019-08-08 22:43:45.834000111+03:00 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
" 1.565293e+09 | \n",
" incoming | \n",
" 1170 | \n",
" 2019-08-08 22:43:45.834000111+03:00 | \n",
"
\n",
" \n",
" | 2020-01-09 01:55:16.996000051+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" outgoing | \n",
" 1256 | \n",
" 2020-01-09 01:55:16.996000051+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:06:09.790999889+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" outgoing | \n",
" 271 | \n",
" 2020-01-09 02:06:09.790999889+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:08:03.895999908+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" incoming | \n",
" 1079 | \n",
" 2020-01-09 02:08:03.895999908+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2019-08-08 22:31:34.539999962+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565293e+09 \n",
"2019-08-08 22:32:25.256999969+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565293e+09 \n",
"2019-08-08 22:43:45.834000111+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565293e+09 \n",
"2020-01-09 01:55:16.996000051+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:06:09.790999889+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"\n",
" call_type call_duration \\\n",
"2019-08-08 22:31:34.539999962+03:00 incoming 1142 \n",
"2019-08-08 22:32:25.256999969+03:00 incoming 1217 \n",
"2019-08-08 22:43:45.834000111+03:00 incoming 1170 \n",
"2020-01-09 01:55:16.996000051+02:00 outgoing 1256 \n",
"2020-01-09 02:06:09.790999889+02:00 outgoing 271 \n",
"2020-01-09 02:08:03.895999908+02:00 incoming 1079 \n",
"\n",
" datetime \n",
"2019-08-08 22:31:34.539999962+03:00 2019-08-08 22:31:34.539999962+03:00 \n",
"2019-08-08 22:32:25.256999969+03:00 2019-08-08 22:32:25.256999969+03:00 \n",
"2019-08-08 22:43:45.834000111+03:00 2019-08-08 22:43:45.834000111+03:00 \n",
"2020-01-09 01:55:16.996000051+02:00 2020-01-09 01:55:16.996000051+02:00 \n",
"2020-01-09 02:06:09.790999889+02:00 2020-01-09 02:06:09.790999889+02:00 \n",
"2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.sort_values(by=['user', 'datetime'], inplace=True)\n",
"data.drop_duplicates(['user','call_duration']).groupby('user').head(3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "754a6ef2",
"metadata": {},
"source": [
"By comparing the last two dataframes, we can see that sorting the values was a good move. For example, in the unsorted dataframe, the earliest date for the user *iGyXetHE3S8u* was 2019-08-08 22:32:25; instead, for the sorted dataframe, the earliest date for the user *iGyXetHE3S8u* is 2019-08-08 22:31:34. Small differences, but still important."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "8251870e",
"metadata": {},
"source": [
"## * TIP! Data format requirements (or what should our data look like)\n",
"\n",
"Data can take other shapes and formats. However, the `niimpy` data scheme requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics:\n",
"1. One row per call. Each row should store information about one call only\n",
"2. Each row's index should be a timestamp\n",
"3. There should be at least four columns: \n",
" - index: date and time when the event happened (timestamp)\n",
" - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user)\n",
" - call_type: stores whether the call was incoming, outgoing, or missed. The exact words *incoming*, *outgoing*, and *missed* should be used\n",
" - call_duration: the duration of the call in seconds\n",
"4. Columns additional to those listed in item 3 are allowed\n",
"5. The names of the columns do not need to be exactly \"user\", \"call_type\" or \"call_duration\" as we can pass our own names in an argument (to be explained later).\n",
"\n",
"Below is an example of a dataframe that complies with these minimum requirements"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7cdb7a0c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" call_type | \n",
" call_duration | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-08 22:31:34.539999962+03:00 | \n",
" iGyXetHE3S8u | \n",
" incoming | \n",
" 1142 | \n",
"
\n",
" \n",
" | 2019-08-08 22:32:25.256999969+03:00 | \n",
" iGyXetHE3S8u | \n",
" incoming | \n",
" 1217 | \n",
"
\n",
" \n",
" | 2019-08-08 22:43:45.834000111+03:00 | \n",
" iGyXetHE3S8u | \n",
" incoming | \n",
" 1170 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user call_type call_duration\n",
"2019-08-08 22:31:34.539999962+03:00 iGyXetHE3S8u incoming 1142\n",
"2019-08-08 22:32:25.256999969+03:00 iGyXetHE3S8u incoming 1217\n",
"2019-08-08 22:43:45.834000111+03:00 iGyXetHE3S8u incoming 1170"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_dataschema = data[['user','call_type','call_duration']]\n",
"example_dataschema.head(3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6fa61501",
"metadata": {},
"source": [
"## 4. Extracting features\n",
"There are two ways to extract features. We could use each function separately or we could use `niimpy`'s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let's first understand how to extract features using stand-alone functions.\n",
"\n",
"### 4.1 Extract features using stand-alone functions\n",
"We can use `niimpy`'s functions to compute communication features. Each function will require two inputs:\n",
"- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n",
"- (optional) arguments for stand-alone functions\n",
"\n",
"#### 4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)\n",
"We can input two types of arguments to customize the way a stand-alone function works:\n",
"- the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument `communication_column_name`. \n",
"\n",
"- the way we resample: resampling options are specified in `niimpy` as a dictionary. `niimpy`'s resampling and aggregating relies on `pandas.DataFrame.resample`, so mastering the use of this pandas function will help us greatly in `niimpy`'s preprocessing. Please familiarize yourself with the pandas resample function before continuing. \n",
" Briefly, to use the `pandas.DataFrame.resample` function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the *close* argument if we would like to specify which side of the interval is closed, or we could use the *offset* argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having `pandas.DataFrame.resample` documentation at hand. All features for the `pandas.DataFrame.resample` will be specified in a dictionary where keys are the arguments' names for the `pandas.DataFrame.resample`, and the dictionary's values are the values for each of these selected arguments. This dictionary will be passed as a value to the key `resample_args` in `niimpy`.\n",
"\n",
"Let's see some basic examples of these dictionaries:"
]
},
{
"cell_type": "markdown",
"id": "f9e2deba",
"metadata": {},
"source": [
"``` Python\n",
"com.call_duration_total(data, communication_column_name = \"call_duration\", resample_args = {\"rule\":\"1D\"})\n",
"com.call_duration_total(data, communication_column_name = \"random_name\", resample_args = {\"rule\":\"30T\"})\n",
"com.call_duration_total(data, communication_column_name = \"other_name\", resample_args = {\"rule\":\"45T\",\"origin\":\"end\"})\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f5c98d97",
"metadata": {},
"source": [
"Here, we have basic feature function calls. \n",
"\n",
"- The first example will analyze the data stored in the column `call_duration` in our dataframe. The data will be binned in one day periods\n",
"- The second example will analyze the data stored in the column `random_name` in our dataframe. The data will be aggregated in 30-minutes bins\n",
"- The third example will analyze the data stored in the column `other_name` in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe. \n",
"\n",
"**Default values:** if no arguments are passed, `niimpy`'s default values are \"call_duration\" for the communication_column_name, and 30-min aggregation bins. "
]
},
{
"cell_type": "markdown",
"id": "91e16683",
"metadata": {},
"source": [
"#### 4.1.2 Using the functions\n",
"Now that we understand how the functions are customized, it is time we compute our first communication feature. Suppose that we are interested in extracting the total duration of outgoing calls every 20 minutes. We will need `niimpy`'s `call_duration_total` function, the data, and we will also need to create a dictionary to customize our function. Let's create the dictionary first"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "dee409a3",
"metadata": {},
"outputs": [],
"source": [
"my_call_duration = com.call_duration_total(data, communication_column_name = 'call_duration', resample_args = {'rule':'20T'})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ef4e43ec",
"metadata": {},
"source": [
"Let's look at some values for one of the subjects."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4efcce2f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" missed_duration_total | \n",
" incoming_duration_total | \n",
" outgoing_duration_total | \n",
" user | \n",
" device | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 01:40:00+02:00 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1256.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:00:00+02:00 | \n",
" 0.0 | \n",
" 1079.0 | \n",
" 2192.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:20:00+02:00 | \n",
" 0.0 | \n",
" 4650.0 | \n",
" 3696.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:40:00+02:00 | \n",
" 0.0 | \n",
" 645.0 | \n",
" 174.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 03:00:00+02:00 | \n",
" 0.0 | \n",
" 269.0 | \n",
" 0.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" missed_duration_total incoming_duration_total \\\n",
"2020-01-09 01:40:00+02:00 0.0 0.0 \n",
"2020-01-09 02:00:00+02:00 0.0 1079.0 \n",
"2020-01-09 02:20:00+02:00 0.0 4650.0 \n",
"2020-01-09 02:40:00+02:00 0.0 645.0 \n",
"2020-01-09 03:00:00+02:00 0.0 269.0 \n",
"\n",
" outgoing_duration_total user device \n",
"2020-01-09 01:40:00+02:00 1256.0 jd9INuQ5BBlW 3p83yASkOb_B \n",
"2020-01-09 02:00:00+02:00 2192.0 jd9INuQ5BBlW 3p83yASkOb_B \n",
"2020-01-09 02:20:00+02:00 3696.0 jd9INuQ5BBlW 3p83yASkOb_B \n",
"2020-01-09 02:40:00+02:00 174.0 jd9INuQ5BBlW 3p83yASkOb_B \n",
"2020-01-09 03:00:00+02:00 0.0 jd9INuQ5BBlW 3p83yASkOb_B "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_call_duration[my_call_duration[\"user\"] == \"jd9INuQ5BBlW\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "906bb7c3",
"metadata": {},
"source": [
"Let's remember how the original data looked like for this subject"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3fe1c875",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" call_type | \n",
" call_duration | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 01:55:16.996000051+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" outgoing | \n",
" 1256 | \n",
" 2020-01-09 01:55:16.996000051+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:06:09.790999889+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" outgoing | \n",
" 271 | \n",
" 2020-01-09 02:06:09.790999889+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:08:03.895999908+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578528e+09 | \n",
" incoming | \n",
" 1079 | \n",
" 2020-01-09 02:08:03.895999908+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:10:06.573999882+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" missed | \n",
" 0 | \n",
" 2020-01-09 02:10:06.573999882+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:11:37.648999929+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" outgoing | \n",
" 1070 | \n",
" 2020-01-09 02:11:37.648999929+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:12:31.164000034+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" outgoing | \n",
" 851 | \n",
" 2020-01-09 02:12:31.164000034+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:21:45.877000093+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" incoming | \n",
" 1489 | \n",
" 2020-01-09 02:21:45.877000093+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 01:55:16.996000051+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:06:09.790999889+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n",
"2020-01-09 02:10:06.573999882+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:11:37.648999929+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:12:31.164000034+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:21:45.877000093+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"\n",
" call_type call_duration \\\n",
"2020-01-09 01:55:16.996000051+02:00 outgoing 1256 \n",
"2020-01-09 02:06:09.790999889+02:00 outgoing 271 \n",
"2020-01-09 02:08:03.895999908+02:00 incoming 1079 \n",
"2020-01-09 02:10:06.573999882+02:00 missed 0 \n",
"2020-01-09 02:11:37.648999929+02:00 outgoing 1070 \n",
"2020-01-09 02:12:31.164000034+02:00 outgoing 851 \n",
"2020-01-09 02:21:45.877000093+02:00 incoming 1489 \n",
"\n",
" datetime \n",
"2020-01-09 01:55:16.996000051+02:00 2020-01-09 01:55:16.996000051+02:00 \n",
"2020-01-09 02:06:09.790999889+02:00 2020-01-09 02:06:09.790999889+02:00 \n",
"2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n",
"2020-01-09 02:10:06.573999882+02:00 2020-01-09 02:10:06.573999882+02:00 \n",
"2020-01-09 02:11:37.648999929+02:00 2020-01-09 02:11:37.648999929+02:00 \n",
"2020-01-09 02:12:31.164000034+02:00 2020-01-09 02:12:31.164000034+02:00 \n",
"2020-01-09 02:21:45.877000093+02:00 2020-01-09 02:21:45.877000093+02:00 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data[\"user\"]==\"jd9INuQ5BBlW\"].head(7)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3739cde8",
"metadata": {},
"source": [
"We see that the bins are indeed 20-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, `pandas` starts the binning at 00:00:00 of everyday and counts 20-minutes intervals from there. \n",
"\n",
"If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "af78c1d1",
"metadata": {},
"outputs": [],
"source": [
"users = list(data['user'].unique())\n",
"results = []\n",
"for user in users:\n",
" start_time = data[data[\"user\"]==user].index.min()\n",
" results.append(com.call_duration_total(\n",
" data[data[\"user\"]==user],\n",
" communication_column_name = \"call_duration\",\n",
" resample_args = {\"rule\":\"20T\",\"origin\":start_time}\n",
" ))\n",
"my_call_duration = pd.concat(results)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "413e1970",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" missed_duration_total | \n",
" incoming_duration_total | \n",
" outgoing_duration_total | \n",
" user | \n",
" device | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-09 07:11:34.539999962+03:00 | \n",
" 0.0 | \n",
" 0 | \n",
" 1322.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | 2019-08-09 07:31:34.539999962+03:00 | \n",
" 0.0 | \n",
" 1034 | \n",
" 959.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | 2019-08-09 07:51:34.539999962+03:00 | \n",
" 0.0 | \n",
" 921 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | 2019-08-09 08:11:34.539999962+03:00 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | 2019-08-09 08:31:34.539999962+03:00 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 2019-08-09 06:51:34.539999962+03:00 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
" Cq9vueHh3zVs | \n",
"
\n",
" \n",
" | 2020-01-09 01:55:16.996000051+02:00 | \n",
" 0.0 | \n",
" 1079 | \n",
" 3448.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:15:16.996000051+02:00 | \n",
" 0.0 | \n",
" 1897 | \n",
" 3078.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:35:16.996000051+02:00 | \n",
" 0.0 | \n",
" 3398 | \n",
" 792.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
" | 2020-01-09 02:55:16.996000051+02:00 | \n",
" 0.0 | \n",
" 269 | \n",
" 0.0 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
"
\n",
" \n",
"
\n",
"
319 rows × 5 columns
\n",
"
"
],
"text/plain": [
" missed_duration_total \\\n",
"2019-08-09 07:11:34.539999962+03:00 0.0 \n",
"2019-08-09 07:31:34.539999962+03:00 0.0 \n",
"2019-08-09 07:51:34.539999962+03:00 0.0 \n",
"2019-08-09 08:11:34.539999962+03:00 0.0 \n",
"2019-08-09 08:31:34.539999962+03:00 0.0 \n",
"... ... \n",
"2019-08-09 06:51:34.539999962+03:00 0.0 \n",
"2020-01-09 01:55:16.996000051+02:00 0.0 \n",
"2020-01-09 02:15:16.996000051+02:00 0.0 \n",
"2020-01-09 02:35:16.996000051+02:00 0.0 \n",
"2020-01-09 02:55:16.996000051+02:00 0.0 \n",
"\n",
" incoming_duration_total \\\n",
"2019-08-09 07:11:34.539999962+03:00 0 \n",
"2019-08-09 07:31:34.539999962+03:00 1034 \n",
"2019-08-09 07:51:34.539999962+03:00 921 \n",
"2019-08-09 08:11:34.539999962+03:00 0 \n",
"2019-08-09 08:31:34.539999962+03:00 0 \n",
"... ... \n",
"2019-08-09 06:51:34.539999962+03:00 0 \n",
"2020-01-09 01:55:16.996000051+02:00 1079 \n",
"2020-01-09 02:15:16.996000051+02:00 1897 \n",
"2020-01-09 02:35:16.996000051+02:00 3398 \n",
"2020-01-09 02:55:16.996000051+02:00 269 \n",
"\n",
" outgoing_duration_total user \\\n",
"2019-08-09 07:11:34.539999962+03:00 1322.0 iGyXetHE3S8u \n",
"2019-08-09 07:31:34.539999962+03:00 959.0 iGyXetHE3S8u \n",
"2019-08-09 07:51:34.539999962+03:00 0.0 iGyXetHE3S8u \n",
"2019-08-09 08:11:34.539999962+03:00 0.0 iGyXetHE3S8u \n",
"2019-08-09 08:31:34.539999962+03:00 0.0 iGyXetHE3S8u \n",
"... ... ... \n",
"2019-08-09 06:51:34.539999962+03:00 0.0 iGyXetHE3S8u \n",
"2020-01-09 01:55:16.996000051+02:00 3448.0 jd9INuQ5BBlW \n",
"2020-01-09 02:15:16.996000051+02:00 3078.0 jd9INuQ5BBlW \n",
"2020-01-09 02:35:16.996000051+02:00 792.0 jd9INuQ5BBlW \n",
"2020-01-09 02:55:16.996000051+02:00 0.0 jd9INuQ5BBlW \n",
"\n",
" device \n",
"2019-08-09 07:11:34.539999962+03:00 Cq9vueHh3zVs \n",
"2019-08-09 07:31:34.539999962+03:00 Cq9vueHh3zVs \n",
"2019-08-09 07:51:34.539999962+03:00 Cq9vueHh3zVs \n",
"2019-08-09 08:11:34.539999962+03:00 Cq9vueHh3zVs \n",
"2019-08-09 08:31:34.539999962+03:00 Cq9vueHh3zVs \n",
"... ... \n",
"2019-08-09 06:51:34.539999962+03:00 Cq9vueHh3zVs \n",
"2020-01-09 01:55:16.996000051+02:00 3p83yASkOb_B \n",
"2020-01-09 02:15:16.996000051+02:00 3p83yASkOb_B \n",
"2020-01-09 02:35:16.996000051+02:00 3p83yASkOb_B \n",
"2020-01-09 02:55:16.996000051+02:00 3p83yASkOb_B \n",
"\n",
"[319 rows x 5 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_call_duration"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "13c86461",
"metadata": {},
"source": [
"### 4.2 Extract features using the wrapper\n",
"We can use `niimpy`'s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs:\n",
"- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n",
"- (optional) an argument dictionary for wrapper\n",
"\n",
"#### 4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)\n",
"The argument dictionary contains the arguments for each stand-alone function we would like to employ. Its keys are the feature functions we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. \n",
"Let's see some examples of wrapper dictionaries:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f3c3492c",
"metadata": {},
"outputs": [],
"source": [
"wrapper_features1 = {com.call_duration_total:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"1D\"}},\n",
" com.call_count:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"1D\"}}}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1d4bb4e9",
"metadata": {},
"source": [
"- `wrapper_features1` will be used to analyze two features, `call_duration_total` and `call_count`. For the feature call_duration_total, we will use the data stored in the column `call_duration` in our dataframe and the data will be binned in one day periods. For the feature call_count, we will use the data stored in the column `call_duration` in our dataframe and the data will be binned in one day periods. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "3ad697d4",
"metadata": {},
"outputs": [],
"source": [
"wrapper_features2 = {com.call_duration_mean:{\"communication_column_name\":\"random_name\",\"resample_args\":{\"rule\":\"1D\"}},\n",
" com.call_duration_median:{\"communication_column_name\":\"random_name\",\"resample_args\":{\"rule\":\"5H\",\"offset\":\"5min\"}}}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0bfe93a5",
"metadata": {},
"source": [
"- `wrapper_features2` will be used to analyze two features, `call_duration_mean` and `call_duration_median`. For the feature call_duration_mean, we will use the data stored in the column `random_name` in our dataframe and the data will be binned in one day periods. For the feature call_duration_median, we will use the data stored in the column `random_name` in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "a8405422",
"metadata": {},
"outputs": [],
"source": [
"wrapper_features3 = {com.call_duration_total:{\"communication_column_name\":\"one_name\",\"resample_args\":{\"rule\":\"1D\",\"offset\":\"5min\"}},\n",
" com.call_count:{\"communication_column_name\":\"one_name\",\"resample_args\":{\"rule\":\"5H\"}},\n",
" com.call_duration_mean:{\"communication_column_name\":\"another_name\",\"resample_args\":{\"rule\":\"30T\",\"origin\":\"end_day\"}}}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "957f2685",
"metadata": {},
"source": [
"- `wrapper_features3` will be used to analyze three features, `call_duration_total`, `call_count`, and `call_duration_mean`. For the feature call_duration_total, we will use the data stored in the column `one_name` and the data will be binned in one day periods with a 5-min offset. For the feature call_count, we will use the data stored in the column `one_name` in our dataframe and the data will be binned in 5-hour periods. Finally, for the feature call_duration_mean, we will use the data stored in the column `another_name` in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.\n",
"\n",
"**Default values:** if no arguments are passed, `niimpy`'s default values are \"call_duration\" for the communication_column_name, and 30-min aggregation bins. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary. \n",
"\n",
"#### 4.2.2 Using the wrapper\n",
"Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the call total duration every 20 minutes. We will need `niimpy`'s `extract_features_comms` function, the data, and we will also need to create a dictionary to customize our function. Let's create the dictionary first"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "34846784",
"metadata": {},
"outputs": [],
"source": [
"wrapper_features1 = {com.call_duration_total:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"20T\"}}}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c8e626ea",
"metadata": {},
"source": [
"Now let's use the wrapper"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "33ebd988",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"computing ...\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" missed_duration_total | \n",
" incoming_duration_total | \n",
" outgoing_duration_total | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 01:40:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1256.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:00:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 1079.0 | \n",
" 2192.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:20:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 4650.0 | \n",
" 3696.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:40:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 645.0 | \n",
" 174.0 | \n",
"
\n",
" \n",
" | 2019-08-09 07:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1322.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user missed_duration_total \\\n",
"2020-01-09 01:40:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:00:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:20:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:40:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2019-08-09 07:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"\n",
" incoming_duration_total outgoing_duration_total \n",
"2020-01-09 01:40:00+02:00 0.0 1256.0 \n",
"2020-01-09 02:00:00+02:00 1079.0 2192.0 \n",
"2020-01-09 02:20:00+02:00 4650.0 3696.0 \n",
"2020-01-09 02:40:00+02:00 645.0 174.0 \n",
"2019-08-09 07:00:00+03:00 0.0 1322.0 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results_wrapper = com.extract_features_comms(data, features=wrapper_features1)\n",
"results_wrapper.head(5)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b1c6ba2b",
"metadata": {},
"source": [
"Our first attempt was succesful. Now, let's try something more. Let's assume we want to compute the call_duration and call_count in 20-minutes bin."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e94e238a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"computing ...\n",
"computing ...\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" missed_duration_total | \n",
" incoming_duration_total | \n",
" outgoing_duration_total | \n",
" outgoing_count | \n",
" incoming_count | \n",
" missed_count | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 01:40:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1256.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:00:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 1079.0 | \n",
" 2192.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:20:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 4650.0 | \n",
" 3696.0 | \n",
" 5.0 | \n",
" 4.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2020-01-09 02:40:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 645.0 | \n",
" 174.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2019-08-09 07:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1322.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user missed_duration_total \\\n",
"2020-01-09 01:40:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:00:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:20:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:40:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2019-08-09 07:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"\n",
" incoming_duration_total outgoing_duration_total \\\n",
"2020-01-09 01:40:00+02:00 0.0 1256.0 \n",
"2020-01-09 02:00:00+02:00 1079.0 2192.0 \n",
"2020-01-09 02:20:00+02:00 4650.0 3696.0 \n",
"2020-01-09 02:40:00+02:00 645.0 174.0 \n",
"2019-08-09 07:00:00+03:00 0.0 1322.0 \n",
"\n",
" outgoing_count incoming_count missed_count \n",
"2020-01-09 01:40:00+02:00 1.0 0.0 0.0 \n",
"2020-01-09 02:00:00+02:00 3.0 1.0 1.0 \n",
"2020-01-09 02:20:00+02:00 5.0 4.0 0.0 \n",
"2020-01-09 02:40:00+02:00 1.0 1.0 0.0 \n",
"2019-08-09 07:00:00+03:00 1.0 0.0 0.0 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wrapper_features2 = {com.call_duration_total:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"20T\"}},\n",
" com.call_count:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"20T\"}}}\n",
"results_wrapper = com.extract_features_comms(data, features=wrapper_features2)\n",
"results_wrapper.head(5)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "86fe7a58",
"metadata": {},
"source": [
"Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let's see how the wrapper performs when each function has different binning requirements. Let's assume we need to compute the call_duration_mean every day, and the call_duration_median every 5 hours with an offset of 5 minutes."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "e749f2f3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"computing ...\n",
"computing ...\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" missed_duration_mean | \n",
" outgoing_duration_mean | \n",
" incoming_duration_mean | \n",
" incoming_duration_median | \n",
" outgoing_duration_median | \n",
" missed_duration_median | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 00:00:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 731.8 | \n",
" 949.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2019-08-09 00:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 1140.5 | \n",
" 651.666667 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2019-08-10 00:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 1363.0 | \n",
" 1298.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2019-08-11 00:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2019-08-12 00:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 209.0 | \n",
" 715.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user missed_duration_mean \\\n",
"2020-01-09 00:00:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2019-08-09 00:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"2019-08-10 00:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"2019-08-11 00:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"2019-08-12 00:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"\n",
" outgoing_duration_mean incoming_duration_mean \\\n",
"2020-01-09 00:00:00+02:00 731.8 949.000000 \n",
"2019-08-09 00:00:00+03:00 1140.5 651.666667 \n",
"2019-08-10 00:00:00+03:00 1363.0 1298.000000 \n",
"2019-08-11 00:00:00+03:00 0.0 0.000000 \n",
"2019-08-12 00:00:00+03:00 209.0 715.000000 \n",
"\n",
" incoming_duration_median outgoing_duration_median \\\n",
"2020-01-09 00:00:00+02:00 NaN NaN \n",
"2019-08-09 00:00:00+03:00 NaN NaN \n",
"2019-08-10 00:00:00+03:00 NaN NaN \n",
"2019-08-11 00:00:00+03:00 NaN NaN \n",
"2019-08-12 00:00:00+03:00 NaN NaN \n",
"\n",
" missed_duration_median \n",
"2020-01-09 00:00:00+02:00 NaN \n",
"2019-08-09 00:00:00+03:00 NaN \n",
"2019-08-10 00:00:00+03:00 NaN \n",
"2019-08-11 00:00:00+03:00 NaN \n",
"2019-08-12 00:00:00+03:00 NaN "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wrapper_features3 = {com.call_duration_mean:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"1D\"}},\n",
" com.call_duration_median:{\"communication_column_name\":\"call_duration\",\"resample_args\":{\"rule\":\"5H\",\"offset\":\"5min\"}}}\n",
"results_wrapper = com.extract_features_comms(data, features=wrapper_features3)\n",
"results_wrapper.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "b2bf57dd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" missed_duration_mean | \n",
" outgoing_duration_mean | \n",
" incoming_duration_mean | \n",
" incoming_duration_median | \n",
" outgoing_duration_median | \n",
" missed_duration_median | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-12 09:05:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2019-08-12 14:05:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2019-08-12 19:05:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 715.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2019-08-13 00:05:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2019-08-13 05:05:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 591.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user missed_duration_mean \\\n",
"2019-08-12 09:05:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u NaN \n",
"2019-08-12 14:05:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u NaN \n",
"2019-08-12 19:05:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u NaN \n",
"2019-08-13 00:05:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u NaN \n",
"2019-08-13 05:05:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u NaN \n",
"\n",
" outgoing_duration_mean incoming_duration_mean \\\n",
"2019-08-12 09:05:00+03:00 NaN NaN \n",
"2019-08-12 14:05:00+03:00 NaN NaN \n",
"2019-08-12 19:05:00+03:00 NaN NaN \n",
"2019-08-13 00:05:00+03:00 NaN NaN \n",
"2019-08-13 05:05:00+03:00 NaN NaN \n",
"\n",
" incoming_duration_median outgoing_duration_median \\\n",
"2019-08-12 09:05:00+03:00 0.0 0.0 \n",
"2019-08-12 14:05:00+03:00 0.0 0.0 \n",
"2019-08-12 19:05:00+03:00 715.0 0.0 \n",
"2019-08-13 00:05:00+03:00 0.0 0.0 \n",
"2019-08-13 05:05:00+03:00 591.0 0.0 \n",
"\n",
" missed_duration_median \n",
"2019-08-12 09:05:00+03:00 0.0 \n",
"2019-08-12 14:05:00+03:00 0.0 \n",
"2019-08-12 19:05:00+03:00 0.0 \n",
"2019-08-13 00:05:00+03:00 0.0 \n",
"2019-08-13 05:05:00+03:00 0.0 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results_wrapper.tail(5)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "086d72a0",
"metadata": {},
"source": [
"The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `call_duration_mean` feature (head). The second one is the 5-hour aggregation period with 5-min offset for the `call_duration_median` (tail). We must note that because the `call_duration_median`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `call_duration_mean`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "db978353",
"metadata": {},
"source": [
"#### 4.2.3 Wrapper and its default option\n",
"The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_comms` function with its default options, simply call the function. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c83dd3e0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n",
"computing ...\n"
]
}
],
"source": [
"default = com.extract_features_comms(data, features=None)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7f1a0c6d",
"metadata": {},
"source": [
"The function prints the computed features so you can track its process. Now let's have a look at the outputs"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "8647fb31",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" missed_duration_total | \n",
" incoming_duration_total | \n",
" outgoing_duration_total | \n",
" missed_duration_mean | \n",
" outgoing_duration_mean | \n",
" incoming_duration_mean | \n",
" incoming_duration_median | \n",
" outgoing_duration_median | \n",
" missed_duration_median | \n",
" missed_duration_std | \n",
" outgoing_duration_std | \n",
" incoming_duration_std | \n",
" outgoing_count | \n",
" incoming_count | \n",
" missed_count | \n",
" outgoing_incoming_ratio | \n",
" distribution | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 01:30:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1256.0 | \n",
" 0.0 | \n",
" 1256.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 1256.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" inf | \n",
" NaN | \n",
"
\n",
" \n",
" | 2020-01-09 02:00:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 2976.0 | \n",
" 5270.0 | \n",
" 0.0 | \n",
" 752.857143 | \n",
" 992.000000 | \n",
" 1079.0 | \n",
" 851.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 443.087060 | \n",
" 545.726122 | \n",
" 7.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
" 2.333333 | \n",
" 0.888889 | \n",
"
\n",
" \n",
" | 2020-01-09 02:30:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" jd9INuQ5BBlW | \n",
" 0.0 | \n",
" 3398.0 | \n",
" 792.0 | \n",
" 0.0 | \n",
" 396.000000 | \n",
" 1132.666667 | \n",
" 1264.0 | \n",
" 396.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 313.955411 | \n",
" 437.058730 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.666667 | \n",
" NaN | \n",
"
\n",
" \n",
" | 2019-08-09 07:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1322.0 | \n",
" 0.0 | \n",
" 1322.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 1322.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" inf | \n",
" 0.833333 | \n",
"
\n",
" \n",
" | 2019-08-09 07:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0.0 | \n",
" 1824.0 | \n",
" 959.0 | \n",
" 0.0 | \n",
" 959.000000 | \n",
" 912.000000 | \n",
" 912.0 | \n",
" 959.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 172.534055 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 0.500000 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user missed_duration_total \\\n",
"2020-01-09 01:30:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:00:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2020-01-09 02:30:00+02:00 3p83yASkOb_B jd9INuQ5BBlW 0.0 \n",
"2019-08-09 07:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"2019-08-09 07:30:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0.0 \n",
"\n",
" incoming_duration_total outgoing_duration_total \\\n",
"2020-01-09 01:30:00+02:00 0.0 1256.0 \n",
"2020-01-09 02:00:00+02:00 2976.0 5270.0 \n",
"2020-01-09 02:30:00+02:00 3398.0 792.0 \n",
"2019-08-09 07:00:00+03:00 0.0 1322.0 \n",
"2019-08-09 07:30:00+03:00 1824.0 959.0 \n",
"\n",
" missed_duration_mean outgoing_duration_mean \\\n",
"2020-01-09 01:30:00+02:00 0.0 1256.000000 \n",
"2020-01-09 02:00:00+02:00 0.0 752.857143 \n",
"2020-01-09 02:30:00+02:00 0.0 396.000000 \n",
"2019-08-09 07:00:00+03:00 0.0 1322.000000 \n",
"2019-08-09 07:30:00+03:00 0.0 959.000000 \n",
"\n",
" incoming_duration_mean incoming_duration_median \\\n",
"2020-01-09 01:30:00+02:00 0.000000 0.0 \n",
"2020-01-09 02:00:00+02:00 992.000000 1079.0 \n",
"2020-01-09 02:30:00+02:00 1132.666667 1264.0 \n",
"2019-08-09 07:00:00+03:00 0.000000 0.0 \n",
"2019-08-09 07:30:00+03:00 912.000000 912.0 \n",
"\n",
" outgoing_duration_median missed_duration_median \\\n",
"2020-01-09 01:30:00+02:00 1256.0 0.0 \n",
"2020-01-09 02:00:00+02:00 851.0 0.0 \n",
"2020-01-09 02:30:00+02:00 396.0 0.0 \n",
"2019-08-09 07:00:00+03:00 1322.0 0.0 \n",
"2019-08-09 07:30:00+03:00 959.0 0.0 \n",
"\n",
" missed_duration_std outgoing_duration_std \\\n",
"2020-01-09 01:30:00+02:00 0.0 0.000000 \n",
"2020-01-09 02:00:00+02:00 0.0 443.087060 \n",
"2020-01-09 02:30:00+02:00 0.0 313.955411 \n",
"2019-08-09 07:00:00+03:00 0.0 0.000000 \n",
"2019-08-09 07:30:00+03:00 0.0 0.000000 \n",
"\n",
" incoming_duration_std outgoing_count \\\n",
"2020-01-09 01:30:00+02:00 0.000000 1.0 \n",
"2020-01-09 02:00:00+02:00 545.726122 7.0 \n",
"2020-01-09 02:30:00+02:00 437.058730 2.0 \n",
"2019-08-09 07:00:00+03:00 0.000000 1.0 \n",
"2019-08-09 07:30:00+03:00 172.534055 1.0 \n",
"\n",
" incoming_count missed_count \\\n",
"2020-01-09 01:30:00+02:00 0.0 0.0 \n",
"2020-01-09 02:00:00+02:00 3.0 1.0 \n",
"2020-01-09 02:30:00+02:00 3.0 0.0 \n",
"2019-08-09 07:00:00+03:00 0.0 0.0 \n",
"2019-08-09 07:30:00+03:00 2.0 1.0 \n",
"\n",
" outgoing_incoming_ratio distribution \n",
"2020-01-09 01:30:00+02:00 inf NaN \n",
"2020-01-09 02:00:00+02:00 2.333333 0.888889 \n",
"2020-01-09 02:30:00+02:00 0.666667 NaN \n",
"2019-08-09 07:00:00+03:00 inf 0.833333 \n",
"2019-08-09 07:30:00+03:00 0.500000 NaN "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"default.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7130b5ac",
"metadata": {},
"source": [
"### 4.3 SMS computations"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f4479b76",
"metadata": {},
"source": [
"`niimpy` includes one function to count the outgoing and incoming SMS. This function is not automatically called by `extract_features_comms`, but it can be used as a standalone. Let's see a quick example where we will upload the SMS data and preprocess it. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "1c784afa",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" message_type | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 02:34:46.644999981+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" incoming | \n",
" 2020-01-09 02:34:46.644999981+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:34:58.802999973+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" outgoing | \n",
" 2020-01-09 02:34:58.802999973+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:35:37.611000061+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" outgoing | \n",
" 2020-01-09 02:35:37.611000061+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:55:40.640000105+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578531e+09 | \n",
" outgoing | \n",
" 2020-01-09 02:55:40.640000105+02:00 | \n",
"
\n",
" \n",
" | 2020-01-09 02:55:50.914000034+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578531e+09 | \n",
" incoming | \n",
" 2020-01-09 02:55:50.914000034+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 02:34:46.644999981+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"2020-01-09 02:34:58.802999973+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"2020-01-09 02:35:37.611000061+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"2020-01-09 02:55:40.640000105+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578531e+09 \n",
"2020-01-09 02:55:50.914000034+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578531e+09 \n",
"\n",
" message_type \\\n",
"2020-01-09 02:34:46.644999981+02:00 incoming \n",
"2020-01-09 02:34:58.802999973+02:00 outgoing \n",
"2020-01-09 02:35:37.611000061+02:00 outgoing \n",
"2020-01-09 02:55:40.640000105+02:00 outgoing \n",
"2020-01-09 02:55:50.914000034+02:00 incoming \n",
"\n",
" datetime \n",
"2020-01-09 02:34:46.644999981+02:00 2020-01-09 02:34:46.644999981+02:00 \n",
"2020-01-09 02:34:58.802999973+02:00 2020-01-09 02:34:58.802999973+02:00 \n",
"2020-01-09 02:35:37.611000061+02:00 2020-01-09 02:35:37.611000061+02:00 \n",
"2020-01-09 02:55:40.640000105+02:00 2020-01-09 02:55:40.640000105+02:00 \n",
"2020-01-09 02:55:50.914000034+02:00 2020-01-09 02:55:50.914000034+02:00 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = niimpy.read_csv(config.MULTIUSER_AWARE_MESSAGES_PATH, tz='Europe/Helsinki')\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "deb7bcd5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" outgoing_count | \n",
" incoming_count | \n",
" user | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2020-01-09 02:30:00+02:00 | \n",
" 3p83yASkOb_B | \n",
" 5 | \n",
" 5.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
" | 2019-08-13 08:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" 1 | \n",
" 1.0 | \n",
" iGyXetHE3S8u | \n",
"
\n",
" \n",
" | 2019-08-13 09:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" 0 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
"
\n",
" \n",
" | 2019-08-13 09:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" 2 | \n",
" 1.0 | \n",
" iGyXetHE3S8u | \n",
"
\n",
" \n",
" | 2019-08-13 10:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" 0 | \n",
" 0.0 | \n",
" iGyXetHE3S8u | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 2020-01-09 12:00:00+02:00 | \n",
" OWd1Uau8POix | \n",
" 0 | \n",
" 0.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
" | 2020-01-09 12:30:00+02:00 | \n",
" OWd1Uau8POix | \n",
" 0 | \n",
" 3.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
" | 2020-01-09 13:00:00+02:00 | \n",
" OWd1Uau8POix | \n",
" 0 | \n",
" 0.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
" | 2020-01-09 13:30:00+02:00 | \n",
" OWd1Uau8POix | \n",
" 0 | \n",
" 0.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
" | 2020-01-09 14:00:00+02:00 | \n",
" OWd1Uau8POix | \n",
" 2 | \n",
" 6.0 | \n",
" jd9INuQ5BBlW | \n",
"
\n",
" \n",
"
\n",
"
114 rows × 4 columns
\n",
"
"
],
"text/plain": [
" device outgoing_count incoming_count \\\n",
"2020-01-09 02:30:00+02:00 3p83yASkOb_B 5 5.0 \n",
"2019-08-13 08:30:00+03:00 Cq9vueHh3zVs 1 1.0 \n",
"2019-08-13 09:00:00+03:00 Cq9vueHh3zVs 0 0.0 \n",
"2019-08-13 09:30:00+03:00 Cq9vueHh3zVs 2 1.0 \n",
"2019-08-13 10:00:00+03:00 Cq9vueHh3zVs 0 0.0 \n",
"... ... ... ... \n",
"2020-01-09 12:00:00+02:00 OWd1Uau8POix 0 0.0 \n",
"2020-01-09 12:30:00+02:00 OWd1Uau8POix 0 3.0 \n",
"2020-01-09 13:00:00+02:00 OWd1Uau8POix 0 0.0 \n",
"2020-01-09 13:30:00+02:00 OWd1Uau8POix 0 0.0 \n",
"2020-01-09 14:00:00+02:00 OWd1Uau8POix 2 6.0 \n",
"\n",
" user \n",
"2020-01-09 02:30:00+02:00 jd9INuQ5BBlW \n",
"2019-08-13 08:30:00+03:00 iGyXetHE3S8u \n",
"2019-08-13 09:00:00+03:00 iGyXetHE3S8u \n",
"2019-08-13 09:30:00+03:00 iGyXetHE3S8u \n",
"2019-08-13 10:00:00+03:00 iGyXetHE3S8u \n",
"... ... \n",
"2020-01-09 12:00:00+02:00 jd9INuQ5BBlW \n",
"2020-01-09 12:30:00+02:00 jd9INuQ5BBlW \n",
"2020-01-09 13:00:00+02:00 jd9INuQ5BBlW \n",
"2020-01-09 13:30:00+02:00 jd9INuQ5BBlW \n",
"2020-01-09 14:00:00+02:00 jd9INuQ5BBlW \n",
"\n",
"[114 rows x 4 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sms = com.message_count(data, config={\"communication_column_name\": \"message_type\", \"call_type_column\": \"message_type\"})\n",
"sms"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "34dad998",
"metadata": {},
"source": [
"Similar to the calls functions, we need to define the `config` dictionary. Likewise, if we leave it empty, then all data is aggregated in 30-minutes bins. We see that the function also differentiates between the incoming and outgoing messages. Let's quickly summarize the data requirements for SMS \n",
"\n",
"## * TIP! Data format requirements for SMS (special case)\n",
"\n",
"Data can take other shapes and formats. However, the `niimpy` data scheme requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics:\n",
"1. One row per call. Each row should store information about one call only\n",
"2. Each row's index should be a timestamp\n",
"3. There should be at least four columns: \n",
" - index: date and time when the event happened (timestamp)\n",
" - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user)\n",
" - message_type: determines if the message was sent (outgoing) or received (incoming)\n",
"4. Columns additional to those listed in item 3 are allowed\n",
"5. The names of the columns do not need to be exactly \"user\", \"message_type\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cd15df86",
"metadata": {},
"source": [
"## 5. Implementing own features"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ad5cded8",
"metadata": {},
"source": [
"If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex).\n",
"Let's assume we need a new function that counts all calls, independent of their direction (outgoing, incoming, etc.). Let's first define the function"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "1012ee86",
"metadata": {},
"outputs": [],
"source": [
"def call_count_all(df, communication_column_name = \"call_duration\", resample_args = {\"rule\":\"30T\"}):\n",
" if len(df)>0:\n",
" result = df.groupby([\"user\", \"device\"])[communication_column_name].resample(**resample_args).count()\n",
" result.rename(\"call_count_all\", inplace=True)\n",
" result = result.to_frame()\n",
" result = result.reset_index([\"user\", \"device\"])\n",
" return result\n",
" \n",
" return None"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1ff7c17e",
"metadata": {},
"source": [
"Then, we can call our new function in the stand-alone way or using the `extract_features_comms` function. Because the stand-alone way is the common way to call functions in python, we will not show it. Instead, we will show how to integrate this new function to the wrapper. Let's read again the data and assume we want the default behavior of the wrapper. "
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "4c55a72d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"computing ...\n"
]
}
],
"source": [
"data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')\n",
"customized_features = com.extract_features_comms(data, features={call_count_all: {}})"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "84735297",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" call_count_all | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2019-08-08 22:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 5 | \n",
"
\n",
" \n",
" | 2019-08-08 23:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0 | \n",
"
\n",
" \n",
" | 2019-08-08 23:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0 | \n",
"
\n",
" \n",
" | 2019-08-09 00:00:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0 | \n",
"
\n",
" \n",
" | 2019-08-09 00:30:00+03:00 | \n",
" Cq9vueHh3zVs | \n",
" iGyXetHE3S8u | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user call_count_all\n",
"2019-08-08 22:30:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 5\n",
"2019-08-08 23:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0\n",
"2019-08-08 23:30:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0\n",
"2019-08-09 00:00:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0\n",
"2019-08-09 00:30:00+03:00 Cq9vueHh3zVs iGyXetHE3S8u 0"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"customized_features.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "niimpy",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}