Author: | Louis Plissonneau, Parikshit Juluri, Mickaël Meulle |
---|---|
Version: | 0.1.0 |
Copyright: | GPLv2 |
Table of Contents
Pytomo is a Python based tomographic tool to perform analysis of YouTube video download rates. We first select an initial list of videos that we would like to start the analysis with. For the videos in this list the Pytomo tool first finds the IP address of the cache servers on which these videos are located. The cache server is pinged to obtain the RTT times. Then we try to download the video for a limited amount of time to calculate the different statistics of the download.
The setup file used to create the python package.
This file.
This folder contains the pytomo package. The contents of it are listed below.
The top-most module that is used to run the Pytomo tool.
Functions.:
- compute_stats(url)
- Return a list of the statistics related to the url.The contents of the list are : (url, cache_url, current_stats) where current_stats is a list containing: [Ping_times, download_statistics, DNS resolver used]
- format_stats(stats)
- Functions used to format the stats obtained from compute_stats function so that they can be inserted into the sqlite3 database. The stats are converted into a tuple. The arguments to this function is the list returned by compute_stats().
- check_out_files(file_pattern, directory, timestamp):
- Returns a full path of the file used for the output. It checks if the path exists, if not then the file is created in the path if possible else it is created in the default user directory.
- do_crawl(result_stream=sys.stdout, timestamp=None):
- Crawls the urls given by the url_file.txt(present in the package), The crawl is performed upto MAX_ROUNDS or MAX_VISITED_URLS
- main(argv=None)
- This is the program wrapper for the start_pytomo module. Is is mainly used to setup and initialize the logging and other startup parameters.
File containing the various parameters and constants that are used for the analysis. The following parameters determine the nature of the crawl.
INPUT_URL_FILE = The file that has the initial list of urls.
MAX_ROUNDS = Maximum number of crawl rounds to performed.
MAX_CRAWLED_URLS = Max number of urls to be visited.
MAX_PER_URL = Max number of related videos to be selected from each url.
MAX_PER_PAGE = Max number of related videos to be considered for selection from each page
address. This resolver will be used to get the IP address of the youtube cache
PING_PACKETS = Nb, of ping packets to be sent.
DOWNLOAD_TIME = The duration for which the video must be downloaded.
buffer.
MIN_PLAYOUT_BUFFER_SIZE = The size of the buffer for the video stream.
RESULT_DIR = The directory to store the text results.
RESULT_FILE = The file to store the text results.
DATABASE_DIR = The directory to store the result database.
DATABASE = The name of the result database
TABLE = The name of the result table.
LOG_DIR = The directory to store the log files.
LOG_FILE = The file to store the logs.
ERROR and CRITICAL)
Module to download youtube video for a limited amount of time and calculate the different statistics needed for the analysis. It has the FileDownloader class and the YouTube InfoExtractor class. The following functions defined in this module are used to get the statistics of the download.
Module to retrieve the IP address of a URL out of a set of nameservers(default nameservers and the ones provided in the config_pytomo file as EXTRA_NAME_SERVERS)
Functions
- get_default_name_servers():
- Returns a list of IP addresses of default name servers.
- get_ip_addresses(url):
- Return a list of tuples with the IP address and the DNS resolver used.
Module to generate the RTT times of a ping.
Module to retrieve the related videos from a file with a list of YouTube links and to store it for next round of the crawl.
Function to get the most popular YouTube videos according to the time frame.
Module that creates and manages the database for th Pytomo results. It has a PytomoDatabase class.The columns of the database are listed in the docstring of the module.
Functions The PytomoDatabase class has the following functions:
- create_pytomo_table(self, table=config_pytomo.TABLE_TIMESTAMP):
- Function to create a table.
- insert_record(self, row):
- Function to insert a record into the database.
- fetch_all(self):
- Function to print all the records of the table.
- close_handle(self):
- Closes the connection to the database.
This is a stripped down version of the Kaa Metadata Python Package (Version : '0.7.7').The package has been modified to be used with the lib_youtube_download.py module so that we can obtain the metadata of the video. The main modification was to make it independent of kaa_base.
This is the DNS Python Package (Version : 1.9.2) that is used to obtain the nameservers for the machine and also to send DNS queries to the nameservers to obtain the IP addresses of the YouTube cache servers.
This folder contains the log files generated by the logger. These files contain the log details generated during the crawl run.
This folder contains the database files used to store the results. The columns of the database are as follows:
Timestamp - A timestamp indicating the time of inserting the row.
Example: YouTube, Megavideo.
Url - The url of the webpage.
CacheUrl - The Url of the cache server hosting the video.
video is downloaded.
of the cache server. Example Google DNS, Local DNS
limited download time.)
VideoDuration - The actual duration of the complete video.
VideoLength - The length(in bytes) of the complete video.
EncodingRate - The encoding rate of the video.
DownloadBytes - The length of the video sample(in bytes).
download.
BufferingDuration - Accumulate time spend in buffering state
PlaybackDuration - Accumulate time spend in playing state.
BufferDurationAtEnd - The buffer length at the end of download.
address of the cache server.
address of the cache server.
address of the cache server.
This folder contains the result files. The results in these file are listed in text format. It is a list containing [Video url, Cache url, IP address of cache server, Ping RTT times to the Cache server, Download stats, Name_IP_address of the DNS resolver] where Download stats = [download_time, data duration, data_length, video encoding_rate, size of video in bytes, Nb.of interruptions, accumulated_buffer size, accumulated_playback , current_remaining buffer]