Module to plot cdf from data or file. Can be called directly.
Bases: object
Hold the figure and its default properties
Put correct styles in the axes lines Should be launch when all lines are plotted Optimised for up to 8 lines in the plot
Plot the cdf of a data array Wrapper to call the plot method of axes
Plot a bin plot of dictionary
Plot the cdf of a list of names and data arrays
The config file for the pytomo setup Lines starting with # are comments
frame (adapted from lib_youtube_api); - function to retrieve the related videos from the Dailymotion api in a list of links.
Author: Ana Oprea Date: 04.09.2012 - modified 17.09.2012 for related links
Usage: To use the functions provided in this module independently, first place yourself just above pytomo folder.Then:
>>> import pytomo.start_pytomo as start_pytomo
>>> TIMESTAMP = 'test_timestamp'
>>> start_pytomo.configure_log_file(TIMESTAMP)
>>> import pytomo.lib_dailymotion_api as lib_dailymotion_api
>>> url = 'http://www.dailymotion.com/video/xkqa0p'
>>> time_f = 'today' # choose from 'today' or 'month' or 'week' or all_time'
>>> max_results = 20
>>> lib_dailymotion_api.get_popular_links(time_f, max_results)
>>> max_per_page = 25
>>> max_per_url = 10
>>> lib_dailymotion_api.get_dailymotion_links(url, max_per_page)
>>> lib_dailymotion_api.get_related_urls(url, max_per_page, max_per_url)
Parse and return a list of the ids of the related videos from the Dailymotion api:
>>> get_all_related_ids('http://www.dailymotion.com/video/xv7ent', 20)
['xv8xoj', 'xvajbn', 'xvbhdi', 'xvam4y', 'xv8x1t', 'xv8sn2', 'xvbx1x',
'xv7gkf', 'xv5cnr', 'xvajng', 'xv9ir4', 'xvakfr', 'xv9hjr', 'xvbwax',
'xv8ttw', 'xv75ou', 'xv587j', 'xvakwj', 'xv8xqp', 'xv9ihm']
Return a set of only Dailymotion links from url
Return the id of a Dailymotion url. >>> url = ‘http://www.dailymotion.com/video/xkqa0p‘ >>> get_id(url) ‘xkqa0p’ >>> url = ‘http://www.dailymotion.com/video/xkqa0p_angry-birds-theme-covered-by-pomplamoose_music‘ >>> get_id(url) ‘xkqa0p’ >>> url = ‘http://www.dailymotion.com/video/xkqa0p?background=493D27&foreground=E8D9AC&highlight=FFFFF0&autoPlay=1‘ >>> get_id(url) ‘xkqa0p’ >>> url = ‘http://vid.ak.dmcdn.net/video/986/034/42430689_mp4_h264_aac.mp4?primaryToken=1343398942_d77027d09aac0c5d5de74d5428fb9e5b‘ >>> get_id(url) ‘42430689’ >>> url = ‘http://www.dailymotion.com/video/xscdm4_le-losc-au-pays-basque_sport?no_track=1‘ >>> get_id(url) ‘xscdm4’ >>> url = ‘http://vid.ec.dmcdn.net/cdn/H264-512x384/video/xmcyww.mp4?77838fedd64fa52abe6a11b3bdbb4e62f4387ebf7cbce2147ea4becc5eee5c418aaa6598bb98a61fc95a02997247e59bfb0dcd58cdf05c1601ded04f75ae357b225da725baad5e97ea6cce6d6a12e17d1c01‘ >>> get_id(url) ‘xmcyww’ >>> url = ‘http://proxy-60.dailymotion.com/video/246/655/37556642_mp4_h264_aac.mp4?auth=1343399602-4098-bdkyfgul-eb00ad223e1964e40b327d75367b273b‘ >>> get_id(url) ‘37556642’ >>> url = ‘http://docs.python.org/tutorial/inputoutput.html‘ >>> get_id(url) ‘inputoutput.html’
Returns the most popular dailymotion links for France. The country should be set as parameter in start_pytomo if user should specify it. The number of videos returned is given as Total_pages. (The results returned are in no particular order). A set of only dailymotion links from url
Return a set of max_links randomly chosen related urls
Returns the time frame in the form accepted by youtube_api >>> get_time_frame(‘today’) ‘popular-today’ >>> get_time_frame(‘week’) ‘popular-week’ >>> get_time_frame(‘month’) ‘popular-month’ >>> get_time_frame(‘all_time’) ‘popular’
Returns the time frame in the form accepted by youtube_api >>> get_time_frame(‘today’) ‘popular-today’ >>> get_time_frame(‘week’) ‘popular-week’ >>> get_time_frame(‘month’) ‘popular-month’ >>> get_time_frame(‘all_time’) ‘popular’
Return the complete link of a Dailymotion url. >>> url_id = ‘x1y0ap’ >>> set_id(url_id) ‘http://www.dailymotion.com/video/x1y0ap‘
Adapted from lib_youtube_download.py to Dailymotion Module to download Dailymotion video for a limited amount of time and calculate the data downloaded within that time
- Usage:
- This module provides two classes: FileDownloader class and the InfoExtractor class. This module is not meant to be called directly.
Bases: pytomo.lib_general_download.InfoExtractor
Information Extractor for Dailymotion
Returns True if URL is suitable to this IE else False >>> die = DailymotionIE(InfoExtractor) >>> die.suitable(‘http://www.dailymotion.com/video/xscdm4_le-losc-au-pays-basque_sport?no_track=1‘) True >>> die.suitable(‘http://www.dailymotion.com‘) False >>> die.suitable(‘http://vid.ec.dmcdn.net/cdn/H264-512x384/video/xscdm4.mp4?77838fedd64fa52abe6a11b3bdbb4e62f4387ebf7cbce2147ea4becc5fe6574d7c3ec5681aa355d923bdca173f151658eefcd8763fc08a9380a7e2f26cbe49b67e583118fb414738b9d3e9db8882d33200be&ec_prebuf=20&ec_rate=68‘) True
Module for sqllite interface to the pytomo database Usage (to be run interactively above the pytomo directory):
import pytomo.start_pytomo as start_pytomo start_pytomo.configure_log_file(‘doc_test’) import pytomo.lib_database as lib_database import time import datetime timestamp = time.strftime(“%Y-%m-%d.%H_%M_%S”) # to make sure a new file is created for every run. db_name = ‘doc_test’ + str(timestamp) + ‘.db’ doc_db = lib_database.PytomoDatabase(db_name) doc_db.create_pytomo_table(‘doc_test_table’) doc_db.describe_tables() row = (datetime.datetime(2011, 5, 6, 15, 30, 50, 103775),
‘Youtube’, ‘http://www.youtube.com/watch?v=RcmKbTR–iA‘, ‘http://v15.lscache3.c.youtube.com‘, ‘173.194.20.56’,’default_10.193.225.12’, None, None, None, 8.9944229125976562, ‘mp4’, 225, 115012833.0, 511168.14666666667, 9575411, 0, 1024, 100, 0.99954795837402344, 7.9875903129577637, 40, 11.722306421319782, 1192528.8804511931, ‘http://www.youtube.com/fake_redirect‘)doc_db.insert_record(row) doc_db.fetch_all() doc_db.fetch_all_parameters([‘DownloadTime’, ‘PingMin’, ‘PingMax’])
>>> import time >>> timestamp = time.strftime("%Y-%m-%d.%H_%M_%S") >>> # to make sure a new file is created for every run we use >>> # timestamp. >>> db_name = 'doc_test_lib_db' + str(timestamp) + '.db' >>> # import pytomo.lib_database as lib_database >>> doc_db = PytomoDatabase(db_name) >>> doc_db.create_pytomo_table('doc_test_table') >>> doc_db.describe_tables() (u'CREATE TABLE doc_test_table(ID TIMESTAMP,\n Service text,\n Url text,\n CacheUrl text,\n IP text,\n Resolver text,\n PingMin real,\n PingAvg real,\n PingMax real,\n DownloadTime real,\n VideoType text,\n VideoDuration real,\n VideoLength real,\n EncodingRate real,\n DownloadBytes int,\n DownloadInterruptions int,\n InitialData real,\n InitialRate real,\n InitialPlaybacKBuffer real,\n BufferingDuration real,\n PlaybackDuration real,\n BufferDurationAtEnd real,\n MaxInstantThp real,\n RedirectUrl text\n )',) >>> import datetime >>> record = (datetime.datetime(2011, 5, 6, 15, 30, 50, 103775), ... 'Youtube', 'http://www.youtube.com/watch?v=RcmKbTR--iA', ... 'http://v15.lscache3.c.youtube.com', ... '173.194.20.56','default_10.193.225.12', None, None, None, ... 8.9944229125976562, 'mp4', 225, 115012833.0, 511168.14666666667, ... 9575411, 0, 1024 ,100, 0.99954795837402344, 7.9875903129577637, ... 35, 11.722306421319782, 1192528.8804511931, None) >>> doc_db.insert_record(record) >>> record = (datetime.datetime(2011, 5, 6, 15, 40, 50, 103775), ... 'Youtube', 'http://www.youtube.com/watch?v=RcmKbTR--iA', ... 'http://v15.lscache3.c.youtube.com', ... '173.194.20.56','default_10.193.225.12', None, None, None, ... 8.9944229125976562, 'mp4', 225, 115012833.0, 511168.14666666667, ... 9575411, 0, 1024, 100, 0.99954795837402344, 7.9875903129577637, ... 40, 11.722306421319782, 1192528.8804511931, ... 'http://www.youtube.com/fake_redirect') >>> doc_db.insert_record(record) >>> doc_db.fetch_all() (u'2011-05-06 15:30:50.103775', u'Youtube', u'http://www.youtube.com/watch?v=RcmKbTR--iA', u'http://v15.lscache3.c.youtube.com', u'173.194.20.56', u'default_10.193.225.12', None, None, None, 8.9944229125976562, u'mp4', 225.0, 115012833.0, 511168.14666666667, 9575411, 0, 1024.0, 100.0, 0.99954795837402344, 7.9875903129577637, 35.0, 11.722306421319782, 1192528.8804511931, None) (u'2011-05-06 15:40:50.103775', u'Youtube', u'http://www.youtube.com/watch?v=RcmKbTR--iA', u'http://v15.lscache3.c.youtube.com', u'173.194.20.56', u'default_10.193.225.12', None, None, None, 8.9944229125976562, u'mp4', 225.0, 115012833.0, 511168.14666666667, 9575411, 0, 1024.0, 100.0, 0.99954795837402344, 7.9875903129577637, 40.0, 11.722306421319782, 1192528.8804511931, u'http://www.youtube.com/fake_redirect') >>> doc_db.fetch_single_parameter('DownloadTime') ... [(u'2011-05-06 15:30:50.103775', 8.9944229125976562), (u'2011-05-06 15:40:50.103775', 8.9944229125976562)] >>> doc_db.fetch_all_parameters(['DownloadTime', 'PingMin', 'PingMax']) ... [(8.9944229125976562, None, None, u'2011-05-06 15:30:50.103775'), (8.9944229125976562, None, None, u'2011-05-06 15:40:50.103775')] >>> doc_db.fetch_start_time() 1304688650 >>> from os import unlink >>> unlink(db_name)
Pytomo database class The columns of the file pytomo_table are as follows: TID - A timestamped ID generated by for each record entered Service - The website on which the analysis is performed
Example: Youtube, Dailymotion
Url - The url of the webpage CacheUrl- The Url of the cache server hosting the video CacheServerDelay- the delay to obtain the cache server url (from the
initial web page)
ResolveTime- The time to get an answer from DNS AS - The AS as resolved by RIPE PingMin - The minimum recorded ping time to the resolved IP address of
the cache server
VideoDuration - The actual duration of the complete video VideoLength - The length (in bytes) of the complete video EncodingRate - The encoding rate of the video: VideoLength/VideoDuration DownloadBytes - The length of the video sample (in bytes) DownloadInterruptions - Nb of interruptions experienced during the
download
BufferingDuration - Accumulate time spend in buffering state PlaybackDuration - Accumulate time spend in playing state BufferDurationAtEnd - The buffer length at the end of download TimeTogetFirstByte - Time to get first byte MaxInstantThp - The max instantaneous throughput of the download RedirectUrl - The Redirection Url in case of an HTTP redirect StatusCode - HTTP Return Code
Function to return the number of rows in a table. If there are problems related to database integrity, -1 is returned.
Function to save (parameter_1, ..., parameter_n, timestamp) in a sorted list of tuples dependent on timestamp
Function to save (timestamp,parameter) in a sorted list of tuples
Function to transform to seconds from epoch time represented by a string of the form ‘%Y-%m-%d %H:%M:%S.%f’ >>> time_to_epoch(‘2012-06-25 14:54:57.422007’) 1340628897 >>> time_to_epoch(None) Traceback (most recent call last):
...
TypeError: expected string or buffer >>> time_to_epoch(‘2012-06-25 14:54:57’) 1340628897 >>> time_to_epoch(‘2012-06-25 14:54:57’) #doctest: +NORMALIZE_WHITESPACE Traceback (most recent call last):
...
ValueError: time data ‘2012-06-25 14:54:57’ does not match format ‘%Y-%m-%d %H:%M:%S.%f’
Module to retrieve the IP address of a URL out of a set of nameservers
Usage: To use the functions provided in this module independently, first place yourself just above pytomo folder.Then:
import pytomo.start_pytomo TIMESTAMP = ‘test_timestamp’ start_pytomo.configure_log_file(TIMESTAMP)
import pytomo.lib_dns as lib_dns url = ‘www.example.com’ lib_dns.get_ip_addresses(url)
lib_dns.get_default_name_servers()
Return a list of IP addresses of default name servers >>> get_default_name_servers() ... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS ‘...............’ >>> # Check for string of the format ‘x.x.x.x’
Module for I/O operations on filesystem. Used to write the index.html to display the graphical interface. Version: 0.1 Author: Ana Oprea Date: 20.07.2012
Usage:
Bases: pytomo.fpdf.fpdf.FPDF, pytomo.fpdf.html.HTMLMixin
Class to create pdf from html
Computes the arithmetic mean of a list of numbers. >>> average([20, 30, 70], 3) 40.0 >>> average([], 0) nan
Verify that all html templates and their plots have been created.
Verify that all html templates and their plots have been created.
Function to return a tuple (start_crawl_time, end_crawl_time, nr_videos, average_ping, average_download_time, average_download_interruptions)
Function to return from the path directory the files for a specific parameter timestamped or None.
The filenames are relative to the parent directory.
>>> import os.path
>>> from tempfile import NamedTemporaryFile
>>> from time import time
>>> PARAM = 'DownloadTime'
>>> TIMESTAMP = str(int(time()))
>>> RRD_PLOT_DIR = 'images'
>>> f1 = NamedTemporaryFile(suffix=PARAM, dir=RRD_PLOT_DIR, delete=False)
>>> f2 = NamedTemporaryFile(suffix=TIMESTAMP, dir=RRD_PLOT_DIR,
... delete=False)
>>> f3 = NamedTemporaryFile(suffix=(PARAM + '_' + TIMESTAMP),
... dir=RRD_PLOT_DIR, delete=False)
>>> os.path.basename(f3.name) == os.path.basename(
... get_file_by_param_timestamp(RRD_PLOT_DIR, PARAM, TIMESTAMP))
True
>>> os.path.basename(f2.name) == os.path.basename(
... get_file_by_param_timestamp(RRD_PLOT_DIR, PARAM, TIMESTAMP))
False
>>> os.path.basename(f1.name) == os.path.basename(
... get_file_by_param_timestamp(RRD_PLOT_DIR, PARAM, TIMESTAMP))
False
>>> f1.close()
>>> f2.close()
>>> f3.close()
>>> os.unlink(f1.name)
>>> os.unlink(f2.name)
>>> os.unlink(f3.name)
Function to return the newest file in a path >>> import os.path >>> from tempfile import NamedTemporaryFile >>> f = NamedTemporaryFile(delete=False) >>> f.name == get_latest_file(os.path.dirname(f.name)) True >>> f.close() >>> os.unlink(f.name)
Function to return the newest file in a path >>> import os.path >>> from tempfile import NamedTemporaryFile >>> INCLUDE = ‘test’ >>> f = NamedTemporaryFile(suffix=INCLUDE, delete=False) >>> f.name == get_latest_specific_file(os.path.dirname(f.name), INCLUDE) True >>> f.close() >>> os.unlink(f.name)
Function to return all the files in path that contain include string in their name >>> import os.path >>> from tempfile import NamedTemporaryFile >>> INCLUDE = ‘test’ >>> f1 = NamedTemporaryFile(suffix=INCLUDE, delete=False) >>> f2 = NamedTemporaryFile(suffix=INCLUDE, delete=False) >>> f3 = NamedTemporaryFile(delete=False) >>> set([f1.name, f2.name]) == set( ... get_specific_files(os.path.dirname(f1.name), INCLUDE)) True >>> set([f1.name, f2.name, f3.name]) == set( ... get_specific_files(os.path.dirname(f1.name), INCLUDE)) False >>> f1.close() >>> f2.close() >>> f3.close() >>> os.unlink(f1.name) >>> os.unlink(f2.name) >>> os.unlink(f3.name)
Function to return the html containing graphs and their explanation for the *parameters
Return the file name of the index (try to create it if it does not exist). Will have the pattern: <TEMPLATES_DIR>/<hostname>.<timestamp>.<param_TEMPLATE_FILE>
Return the file name of the pdf (try to create it if it does not exist). Will have the pattern: <PDF_DIR>/<hostname>.<timestamp>.<param_PDF_FILE>
Return the file name of the plot (try to create it if it does not exist). Will have the pattern: <RRD_PLOT_DIR>/<hostname>.<timestamp>.<param_IMAGE_FILE>
Return the path to the plot relative to the TEMPLATES_DIR Will have the pattern: ../<RRD_PLOT_DIR>/<plot_name>
Return the path to the plot relative to the TEMPLATES_DIR Will have the pattern: <RRD_PLOT_DIR>/<plot_name>
Return the file name of the rrd (try to create it if it does not exist). Will have the pattern: <RRD_DIR>/<hostname>.<timestamp>.<RRD_FILE>
Write the list of databases from db_dir in the html template.
Function to create the parameter_timestamp_index.html from template files and include the images that also contain a specific timestamp.
Function to write the header and contents of the left column - links
Module to retrieve all links from a web page. Usage:
import pytomo.lib_cache_url as lib_cache_url import pytomo.start_pytomo as start_pytomo log_file = ‘test_cache_url’ start_pytomo.configure_log_file(log_file)
url = ‘http://www.youtube.com/charts/videos_views‘ all_links = lib_cache_url.get_all_links(url)
Bases: urllib2.Request
Class to return only the header of a request
Bases: htmllib.HTMLParser
Simple HTML parser to obtain the urls from webpage
Parse and return a list of the links from the HTMLParser
Module to generate the RTT times of a ping
This module provides two functions that enable us to get the ping statistics of an IP address on any system(Linux, Windows, Mac)
Module to plot the data and generate the PNG/PDF image file
Module for RRDtool interface to the pytomo data.
Version: 0.1 Author: Ana Oprea Date: 11.07.2012
- Usage:
# first create a database - follow steps in lib_database import pytomo.start_pytomo as start_pytomo start_pytomo.configure_log_file(‘doc_test’) import pytomo.lib_database as lib_database from pytomo.lib_plot import UNITS import pytomo.lib_rrdtools as lib_rrdtools
pytomo_rrd = lib_rrdtools.PytomoRRD(db_name) pytomo_rrd.update_pytomo_rrd() pytomo_rrd.plot_pytomo_rrd()
>>> import time >>> timestamp = time.strftime("%Y-%m-%d.%H_%M_%S") >>> # to make sure a new file is created for every run we use >>> # timestamp. >>> db_name = 'doc_test_lib_db' + str(timestamp) + '.db' >>> # import pytomo.lib_database as lib_database >>> doc_db = lib_database.PytomoDatabase(db_name) >>> doc_db.create_pytomo_table('doc_test_table') >>> import datetime >>> record = (datetime.datetime(2011, 5, 6, 15, 30, 50, 103775), ... 'Youtube', 'http://www.youtube.com/watch?v=RcmKbTR--iA', ... 'http://v15.lscache3.c.youtube.com', ... '173.194.20.56','default_10.193.225.12', None, None, None, ... 8.9944229125976562, 'mp4', 225, 115012833.0, 511168.14666666667, ... 9575411, 0, 1024 ,100, 0.99954795837402344, 7.9875903129577637, ... 35, 11.722306421319782, 1192528.8804511931, None) >>> doc_db.insert_record(record) >>> record = (datetime.datetime(2011, 5, 6, 15, 31, 10, 103775), ... 'Youtube', 'http://www.youtube.com/watch?v=RcmKbTR--iA', ... 'http://v15.lscache3.c.youtube.com', ... '173.194.20.56','default_10.193.225.12', None, None, None, ... 8.9944229125976562, 'mp4', 225, 115012833.0, 511168.14666666667, ... 9575411, 0, 1024, 100, 0.99954795837402344, 7.9875903129577637, ... 40, 11.722306421319782, 1192528.8804511931, ... 'http://www.youtube.com/fake_redirect') >>> doc_db.insert_record(record) >>> pytomo_rrd = PytomoRRD(db_name) >>> pytomo_rrd.update_pytomo_rrd() >>> pytomo_rrd.plot_pytomo_rrd() >>> from os import unlink >>> unlink(db_name)
Pytomo class to interact with rrdtools
Function to return a list of elements ‘DS:ds-name:GAUGE:heartbeat:U:U’ >>> HEARTBEAT = 100 >>> create_DS_types([‘BufferDurationAtEnd’, ‘PingMin’, ... ‘InitialData’], HEARTBEAT) #doctest: +NORMALIZE_WHITESPACE [‘DS:BufferDurationAtEnd:GAUGE:100:U:U’, ‘DS:PingMin:GAUGE:100:U:U’, ‘DS:InitialData:GAUGE:100:U:U’] >>> create_DS_types([], HEARTBEAT) [] >>> create_DS_types(None, HEARTBEAT) Traceback (most recent call last):
...
TypeError: ‘NoneType’ object is not iterable
Function to return a list where None arguments are transformed to U >>> format_null_values((‘2012-06-25 14:54:57.422007’, 0.0, None, 130048.0, ... None, 4643.9046215020562)) #doctest: +NORMALIZE_WHITESPACE [(‘2012-06-25 14:54:57.422007’, 0.0, None, 130048.0, None,
4643.9046215020562)]
>>> format_null_values(*('2012-06-25 14:54:57.422007', 0.0, None, 130048.0,
... None, 4643.9046215020562))
['2012-06-25 14:54:57.422007', 0.0, 'U', 130048.0, 'U', 4643.9046215020562]
>>> format_null_values(None)
['U']
>>> format_null_values('2012-06-25 14:54:57.422007',*(0.0, None, 130048.0,
... None, 4643.9046215020562))
['2012-06-25 14:54:57.422007', 0.0, 'U', 130048.0, 'U', 4643.9046215020562]
>>> format_null_values('2012-06-25 14:54:57.422007',(0.0, None, 130048.0,
... None, 4643.9046215020562))
['2012-06-25 14:54:57.422007', (0.0, None, 130048.0, None,
4643.9046215020562)]
Function to create a list of filenames for plots like: RRD_PLOT_DIR/parameter_to_plot_timestamp.extension TODO: redo doctest >>> from time import time >>> TIMESTAMP = ‘2012-07-20.11_44_27’ >>> generate_plot_names([‘DownloadTime’, ... ‘PingMin’], TIMESTAMP) #doctest: +NORMALIZE_WHITESPACEi, +ELLIPSIS [‘/home/capture/co/pytomo/trunk/Pytomo/images/s-spo-hti.2012-07-20.11_44_27.DownloadTime_pytomo_image.png’,
‘/home/capture/co/pytomo/trunk/Pytomo/images/s-spo-hti.2012-07-20.11_44_27.PingMin_pytomo_image.png’]
>>> generate_plot_names([], TIMESTAMP)
[]
>>> generate_plot_names(None, TIMESTAMP)
Traceback (most recent call last):
...
TypeError: 'NoneType' object is not iterable
Escape the : in the filename of a rrd because this is not accepted in the rrd_graph when defining a function (problem appears generally on
Windows) >>> rrd_filename_escape_colon(‘/home/capture/co/pytomo/trunk/Pytomo/rrds/s-spo-hti.1350291171.pytomo.rrd’) ‘/home/capture/co/pytomo/trunk/Pytomo/rrds/s-spo-hti.1350291171.pytomo.rrd’ >>> rrd_filename_escape_colon(‘C:Pytomo
rdss-spo-hti.1350291171.pytomo.rrd’
Function to return a string ‘%i:%s:...:%s’ dependent on the number of DS >>> update_data_types([‘BufferDurationAtEnd’, ‘PingMin’, ‘InitialData’]) ‘%i:%s:%s:%s’ >>> update_data_types([]) ‘%i’ >>> update_data_types(None) Traceback (most recent call last):
...
TypeError: object of type ‘NoneType’ has no len()
Returns: A list containing the list of videos.
first place yourself just above pytomo folder.Then:
import pytomo.start_pytomo TIMESTAMP = ‘test_timestamp’ start_pytomo.configure_log_file(TIMESTAMP)
import pytomo.lib_youtube_api as lib_youtube_api time = ‘today’ # choose from ‘today’ or ‘month’ or ‘week’ or all_time’ max_results = 25 time_frame = lib_youtube_api.get_time_frame(time) lib_youtube_api.get_popular_links(time_frame, max_results) url = ‘http://www.youtube.com/watch?v=cv5bF2FJQBc‘ max_per_page = 25 max_per_url = 10 lib_youtube_api.get_youtube_links(url) lib_youtube_api.get_related_urls(url, max_per_page, max_per_url)
Returns the most popular youtube links (world-wide). The number of videos returned is given as Total_pages. (The results returned are in no particular order). A set of only Youtube links from url
Return a set of max_links randomly chosen related urls
Returns the time frame in the form accepted by youtube_api >>> from . import start_pytomo >>> start_pytomo.configure_log_file(‘doc_test’) #doctest: +ELLIPSIS Configuring log file Logs are there: ... ... >>> get_time_frame(‘today’) ‘t’ >>> get_time_frame(‘week’) ‘w’ >>> get_time_frame(‘month’) ‘m’ >>> get_time_frame(‘all_time’) ‘a’ >>> get_time_frame(‘other’) ‘a’
Return a set of only Youtube links from url
Return the interesting part of a Youtube url >>> url= ‘http://www.youtube.com/watch?v=hE0207sxaPg&feature=hp_SLN&list=SL‘ >>> trunk_url(url) #doctest: +NORMALIZE_WHITESPACE ‘http://www.youtube.com/watch?v=hE0207sxaPg‘ >>> url = ‘http://www.youtube.com/watch?v=y2kEx5BLoC4& ... feature=list_related&playnext=1&list=MLGxdCwVVULXfxx-61LMYHbwpcwAvZd-rI’ >>> trunk_url(url) #doctest: +NORMALIZE_WHITESPACE
>>> url = 'http://www.youtube.com/watch?v=UC-RFFIMXlA'
>>> trunk_url(url)
'http://www.youtube.com/watch?v=UC-RFFIMXlA'
Module to download youtube video for a limited amount of time and calculate the data downloaded within that time
- Usage:
- This module provides two classes: FileDownloader class and the InfoExtractor class. This module is not meant to be called directly.
Bases: pytomo.lib_general_download.InfoExtractor
Information extractor for youtube.com.
Decide which formats to download with req_format (default is best quality) Return video url list
Returns True if URL is suitable to this IE else False >>> yie = YoutubeIE(InfoExtractor) >>> yie.suitable(‘http://www.youtube.com/watch?v=rERIxeYOYhI‘) True >>> yie.suitable(‘http://www.youtube.com‘) False >>> yie.suitable(‘http://www.youtube.com/watch?v=-VB2dHVNyds&‘) True >>> yie.suitable(‘http://www.youtube.com/watch?’) False >>> yie.suitable(‘http://youtu.be/3VdOTTfSKyM‘) True
Module to launch a crawl. This module supplies the following functions that can be used independently:
- compute_stats: To calculate the download statistics of a URL.
- Usage:
- To use the functions provided in this module independently, first place yourself just above pytomo folder.Then:
import pytomo.start_pytomo as start_pytomo import pytomo.config_pytomo as config_pytomo config_pytomo.LOG_FILE = ‘-‘ import time timestamp = time.strftime(‘%Y-%m-%d.%H_%M_%S’) log_file = start_pytomo.configure_log_file(timestamp) import platform config_pytomo.SYSTEM = platform.system() url = ‘http://youtu.be/3VdOTTfSKyM‘ start_pytomo.compute_stats(url) # test Dailymotion url = ‘http://www.dailymotion.com/video/xscdm4_le-losc-au-pays-basque_sport?no_track=1‘
import pytomo.start_pytomo as start_pytomo import pytomo.config_pytomo as config_pytomo config_pytomo.LOG_FILE = ‘-‘ import time timestamp = time.strftime(‘%Y-%m-%d.%H_%M_%S’) log_file = start_pytomo.configure_log_file(timestamp) import platform config_pytomo.SYSTEM = platform.system()
# video delivered by akamai CDN url = ‘http://www.dailymotion.com/video/xp9fq9_test-video-akamai_tech‘ start_pytomo.compute_stats(url) # redirect url: do not work url = ‘http://vid.ak.dmcdn.net/video/986/034/42430689_mp4_h264_aac.mp4?primaryToken=1343398942_d77027d09aac0c5d5de74d5428fb9e5b‘ start_pytomo.compute_stats(url, redirect=True)
# video delivered by edgecast CDN url = ‘http://www.dailymotion.com/video/xmcyww_test-video-cell-edgecast_tech‘ start_pytomo.compute_stats(url) url = ‘http://vid.ec.dmcdn.net/cdn/H264-512x384/video/xmcyww.mp4?77838fedd64fa52abe6a11b3bdbb4e62f4387ebf7cbce2147ea4becc5eee5c418aaa6598bb98a61fc95a02997247e59bfb0dcd58cdf05c1601ded04f75ae357b225da725baad5e97ea6cce6d6a12e17d1c01‘ start_pytomo.compute_stats(url, redirect=True)
# video delivered by dailymotion servers url = ‘http://www.dailymotion.com/video/xmcyw2_test-video-cell-core_tech‘ start_pytomo.compute_stats(url) url = ‘http://proxy-60.dailymotion.com/video/246/655/37556642_mp4_h264_aac.mp4?auth=1343399602-4098-bdkyfgul-eb00ad223e1964e40b327d75367b273b‘ start_pytomo.compute_stats(url, redirect=True)
Bases: exceptions.Exception
Class to stop crawling when the max nb of urls has been attained
Bases: exceptions.Exception
Class to generate timeout exceptions
Insert the stats in the db and update the crawled urls
Check if the urls should be fully downloaded
Return a full path of the file used for the output Test if the path exists, create if possible or create it in default user directory
>>> file_pattern = None
>>> directory = 'logs'
>>> timestamp = 'doc_test'
>>> check_out_files(file_pattern, directory, timestamp)
>>> file_pattern = 'pytomo.log'
>>> check_out_files(file_pattern, directory, timestamp)
'...doc_test.pytomo.log'
Return a list of the download statistics related to the cache_uri
Return a list of the statistics related to the url
Set timeout if OS support it Return a bool indicating if signal is supported
Configure log file and indicate succes or failure
Convert the string passed to a logging level
Crawl the link and return the next urls
Wrapper to crawl each input link
Crawls the urls given by the url_file up to max_rounds are performed or max_visited_urls
Perform the rounds of crawl
Return the stats as a list of tuple to insert into database >>> stats = (‘http://www.youtube.com/watch?v=RcmKbTR–iA‘, ... ‘http://v15.lscache3.c.youtube.com‘, ... {‘173.194.20.56’: [datetime.datetime( ... 2011, 5, 6, 15, 30, 50, 103775), ... None, ... [8.9944229125976562, ‘mp4’, ... 225, ... 115012833.0, ... 511168.14666666667, ... 9575411, ... 0, ... 0.99954795837402344, ... 7.9875903129577637, ... 11.722306421319782, ... 1192528.8804511931, 15169], ... None, ‘default_10.193.225.12’]})
>>> format_stats(stats)
[(datetime.datetime(2011, 5, 6, 15, 30, 50, 103775),
'Youtube', 'http://www.youtube.com/watch?v=RcmKbTR--iA',
'http://v15.lscache3.c.youtube.com', '173.194.20.56',
'default_10.193.225.12', 15169, None, None, None, 8.9944229125976562,
'mp4', 225, 115012833.0, 511168.14666666667, 9575411, 0,
0.99954795837402344, 7.9875903129577637, 11.722306421319782,
1192528.8804511931, None)]
>>> stats = ('http://www.youtube.com/watch?v=OdF-oiaICZI',
... 'http://v7.lscache8.c.youtube.com',
... {'74.125.105.226': [datetime.datetime(
... 2011, 5, 6, 15, 30, 50, 103775),
... [26.0, 196.0, 82.0],
... [30.311000108718872, 'mp4',
... 287.487, 16840065.0,
... 58576.78781997099,
... 1967199, 0,
... 1.316999912261963,
... 28.986000061035156,
... 5.542251416248594,
... 1109.4598961624772, 15169],
... 'http://www.youtube.com/fake_redirect',
... 'google_public_dns_8.8.8.8_open_dns_208.67.220.220'],
... '173.194.8.226': [datetime.datetime(2011, 5, 6, 15,
... 30, 51, 103775),
... [103.0, 108.0, 105.0],
... [30.287999868392944, 'mp4',
... 287.487, 16840065.0,
... 58576.78781997099,
... 2307716,
... 0,
... 1.3849999904632568,
... 28.89300012588501,
... 11.47842453761781,
... 32770.37517215069, 15169],
... None, 'default_212.234.161.118']})
>>> format_stats(stats)
[(datetime.datetime(2011, 5, 6, 15, 30, 50, 103775),
'Youtube', 'http://www.youtube.com/watch?v=OdF-oiaICZI',
'http://v7.lscache8.c.youtube.com', '74.125.105.226',
'google_public_dns_8.8.8.8_open_dns_208.67.220.220', 15169, 26.0, 196.0, 82.0,
30.311000108718872, 'mp4', 287.48700000000002, 16840065.0,
58576.787819970988, 1967199, 0, 1.3169999122619629,
28.986000061035156, 5.5422514162485941, 1109.4598961624772,
'http://www.youtube.com/fake_redirect'),
(datetime.datetime(2011, 5, 6, 15, 30, 51, 103775),
'Youtube', 'http://www.youtube.com/watch?v=OdF-oiaICZI',
'http://v7.lscache8.c.youtube.com', '173.194.8.226',
'default_212.234.161.118', 103.0, 108.0, 105.0, 30.287999868392944,
'mp4', 287.48700000000002, 16840065.0, 58576.787819970988, 2307716,
0, 1.3849999904632568, 28.89300012588501, 11.47842453761781,
32770.375172150692, None)]
Return a tuple of the set of input urls and a set of related url of videos. Arguments:
- input_links: list of the urls
- max_per_url and max_per_page options
- out_file_name: if provided, list is dump in it
Computes and stores the md5 hash of result and database files
Get and logs the provider from the user or skip after timeout seconds
Function to prompt the user to enter max_urls
Function to prompt for provider
Function to prompt the user to enter the proxies it uses to connect to the internet
Return the list of cache url servers for a given video. The last element is the server from which the actual video is downloaded.
Return the libraries to use for dowloading and retrieving specific links
Sets the max number of videos to be crawlled
Convert the proxy passed to a dict to be handled by urllib2
Note
External module included: webpy (http://webpy.org/)
>>> # call the class from top level
>>> start_server.py
Class that serves the main page. Will search for a .html file under the folder set in render below.
Class that serves the PDF reports. Will search for elements under the directories mentioned in urls related to this class.