tag:blogger.com,1999:blog-65195355354341499512024-02-07T16:24:34.743-08:00theta_etcmusings and solutions to interesting problems: Ones that I think are easy, but then have a lot of structure.ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.comBlogger22125tag:blogger.com,1999:blog-6519535535434149951.post-16200241104514971792016-08-25T23:49:00.003-07:002016-08-25T23:55:23.738-07:00trying to embed pdfIntroductory text or abstract, then<br />
to read more:<br />
Here is the <a href="https://plus.google.com/113068923602287446137/posts/YRWaMJ8yQNH" target="_blank">doc</a><br />
<br />
so the steps are:<br />
upload pdf to drive<br />
change settings to public<br />
copy the link<br />
then in blogger, new post, compose, insert link and whatever text<br />
then publish, it will appear on Google+.<br />
<br />
Check: as a reader, click on link, it takes you to the pdf on owner's google+ or googleDocs whatever, click yet again on that and it leads to to full pdf.<br />
<br />
If this is what Google does with simple pdf publication, how can we trust them to design a self-driveling car? ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-59457438222214222792015-06-07T00:00:00.001-07:002015-06-07T00:03:33.735-07:00Video CPM as a Function of Demand - Complete<html>
<title>Video CPM as a function of Demand</title>
<script type="text/javascript">
function round(value, decimals) {
return Number(Math.round(value+'e'+decimals)+'e-'+decimals);
}
function dailySpend() {
var annualSpend = document.calcForm1.annualSpend.value;
var dailySpendVar = '';
dailySpendVar = Math.round(1000*annualSpend/365.);
document.calcForm1.dailySpendVar.value = dailySpendVar;
console.log("Annual Spend ($Million) = " + annualSpend);
console.log("Daily Spend ($1K) = " + dailySpendVar);
// debugger;
return false;
}
function dailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var dailyCostVar = '';
dailyCostVar = Math.round((1-margin/100.)*(1000*annualSpend/365.)) ;
document.calcForm2.dailyCostVar.value = dailyCostVar;
console.log("Margin (%) = " + margin);
console.log("Daily Cost ($1K) = " + dailyCostVar);
return false;
}
function actualDailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var dailyCost = '';
var actualDailyCostVar = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCostVar = Math.round((spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost) ;
document.calcForm3.actualDailyCostVar.value = actualDailyCostVar;
console.log("Current Cost Fraction on Tiered Inventory (%) = " + costFraction);
console.log("Expected Spend Fraction on Tiered Inventory (%) = " + spendFraction);
console.log("Actual Daily Cost on Tiered Inventory ($1K) = " + actualDailyCostVar);
return false;
}
function CPM() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var CPM10K = document.calcForm4.CPM10K.value;
var exponent = document.calcForm4.exponent.value;
var dailyCost = '';
var actualDailyCost = '';
var costCPM = '';
var revenueCPM = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCost = (spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost ;
costCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent), 2)
revenueCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent)/(1-margin/100.), 2)
document.calcForm4.costCPM.value = costCPM;
document.calcForm4.revenueCPM.value = revenueCPM;
console.log("CPM at $10K daily cost = " + CPM10K);
console.log("elasticity = " + exponent);
console.log("Cost Based CPM = " + costCPM);
console.log("Revenue CPM = " + revenueCPM);
return false;
}
</script><br />
<h1>
Experiment description and Analysis</h1>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">13 biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Tier I-III inventory and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively. </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGwXGyjCDWwl1Q5kx6ohGODBwCRelZq0donYhwAKlcb9yJBZr8bBUTb8pcy6uz9xydnkUMKTFi8QsWGIBZt_7X_gWDNedaqudEw5FXPnMdWt_P-p5byKbNvSU9wC2gncwOTsDtGgd9cj4L/s1600/BudgetTable.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGwXGyjCDWwl1Q5kx6ohGODBwCRelZq0donYhwAKlcb9yJBZr8bBUTb8pcy6uz9xydnkUMKTFi8QsWGIBZt_7X_gWDNedaqudEw5FXPnMdWt_P-p5byKbNvSU9wC2gncwOTsDtGgd9cj4L/s320/BudgetTable.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">For each of these budget points we have the spend, cost and impressions in the TierI-III inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRAJKH7xUOYVr7R19uZ2jDrvapILKSVS7JS7aLQU_6PccUpcYDNfIAZGdtOQI4-RkuEeQY_HzkLowKouUhjMUC07N4I5zpMyJLNUG0NVnoj3LgZvlCXLdm-E99WWI_BC2I1XK9wJhm5XUI/s1600/TierI-III_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRAJKH7xUOYVr7R19uZ2jDrvapILKSVS7JS7aLQU_6PccUpcYDNfIAZGdtOQI4-RkuEeQY_HzkLowKouUhjMUC07N4I5zpMyJLNUG0NVnoj3LgZvlCXLdm-E99WWI_BC2I1XK9wJhm5XUI/s320/TierI-III_data.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
From other sources (Roni's presentation) we know that about 15% of all our impressions (50% of all avails and 20% of cost) are in Tier I-III, </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAPoaNazxLyqhbINGvll-BuoiOAlKQEILrwwIzGSgMJ4cUZ7dbnCEPys7d04kHyXckcGmMcb0xyO13xPO07icL7s29zE3LUANv_dDdArQ9PRt1pSUkpOlzaP6kfFPr78hzwHEBt1OnkSma/s1600/TierTree.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAPoaNazxLyqhbINGvll-BuoiOAlKQEILrwwIzGSgMJ4cUZ7dbnCEPys7d04kHyXckcGmMcb0xyO13xPO07icL7s29zE3LUANv_dDdArQ9PRt1pSUkpOlzaP6kfFPr78hzwHEBt1OnkSma/s320/TierTree.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">so for all the video tactics which weren't restricted to TierI-III as part of this test, we uniformly apportioned 20% of spend and cost (assuming margin was same) and 15% impressions in the tested user buckets to TierI-III inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNs787ArjVb_ZEdMQYIowou9APKMRvOblXNqT61X8ixQqUsQYGXG2-s89UIJgwYGF_EfSfTK-orcGL4_fHSFmGhayJZ2MQGceyZZ-uuxp8mzXxtknR9coSc5IO4MhQ4nxuZx6j0beNuyKn/s1600/non-test_restricted_bkts_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNs787ArjVb_ZEdMQYIowou9APKMRvOblXNqT61X8ixQqUsQYGXG2-s89UIJgwYGF_EfSfTK-orcGL4_fHSFmGhayJZ2MQGceyZZ-uuxp8mzXxtknR9coSc5IO4MhQ4nxuZx6j0beNuyKn/s320/non-test_restricted_bkts_data.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
The above spend cost and impressions were added to those for the tested tactics, </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMcxrXa5p1JtmfR_GX4-gczCD0lfz84p6vnT0_wmk1Aj96oAeEYrOh-u27hnol9qDIdfYopFEQ4ZjDWRpsAtgfnSKk858LSKJR28yXP9k7f1GZe14Y1KGFL2vZ0vWb5waCKI-eBbG7AzxV/s1600/restricted_bkts_all_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMcxrXa5p1JtmfR_GX4-gczCD0lfz84p6vnT0_wmk1Aj96oAeEYrOh-u27hnol9qDIdfYopFEQ4ZjDWRpsAtgfnSKk858LSKJR28yXP9k7f1GZe14Y1KGFL2vZ0vWb5waCKI-eBbG7AzxV/s320/restricted_bkts_all_data.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">and CPM vs daily cost was plotted and fitted to a power law. </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7MdUBSrdQSRjGxSKHP5aiPY3MRDzGd0n1HqP3SWDtn0N0yKl6XCkJGJcopxFsRH6dvWQje5WfnyaYQsqmkJAuysiZDhQdEGgkAgVwJUfxhXn4lTIALkciP1_KcM-4E2BsbecZAiad4RAo/s1600/CPM_elasticity_cost.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7MdUBSrdQSRjGxSKHP5aiPY3MRDzGd0n1HqP3SWDtn0N0yKl6XCkJGJcopxFsRH6dvWQje5WfnyaYQsqmkJAuysiZDhQdEGgkAgVwJUfxhXn4lTIALkciP1_KcM-4E2BsbecZAiad4RAo/s320/CPM_elasticity_cost.jpg" /></a></div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law. </span></span></h1>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
dailySpend = AnnualSpend/365
</span></span></h1>
<br />
<form action="" name="calcForm1" onsubmit="return dailySpend();">
<label>Enter Annual Spend in all Video ($Million) = <input name="annualSpend" value="80" /></label><br />
<input type="submit" value="Calculate Daily Spend" /><br />
<label>Daily Spend in all Video ($1K) = <input name="dailySpendVar" value="" /></label></form>
<br />
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
dailyCost = (1-margin)*dailySpend
</span></span></h1>
<br />
<form action="" name="calcForm2" onsubmit="return dailyCost();">
<label>Enter Margin (%) = <input name="margin" value="40" /></label><br />
<input type="submit" value="Calculate Daily Cost" /><br />
<label>Daily Cost in all Video ($1K) = <input name="dailyCostVar" value="" /></label><br />
</form>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost
</span></span></h1>
<br />
<form action="" name="calcForm3" onsubmit="return actualDailyCost();">
<label>Enter Current Tier I-III Inventory Cost fraction (%) = <input name="costFraction" value="20" /></label><br />
<label>Enter Fraction of Video spend in Tier restricted campaigns (%) = <input name="spendFraction" value="100" /></label><br />
<input type="submit" value="Calculate ..." /><br />
<label>Actual Daily Cost in Tier I-III Video ($1K) = <input name="actualDailyCostVar" value="" /></label><br />
</form>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">
Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost <br />
Revenue CPM = CostCPM/(1-margin)
</span></span></h1>
<br />
<form action="" name="calcForm4" onsubmit="return CPM();">
<label>Enter Cost based CPM at $10K Daily cost in TierI-III = <input name="CPM10K" value="3.7" /></label><br />
<label>Enter Elasticity of CBCPM vs Cost, exponent in power law = <input name="exponent" value="0.15" /></label><br />
<input type="submit" value="Calculate CostCPM and Revenue CPM" /><br />
<label>Cost Based CPM ($) = <input name="costCPM" value="" /></label><br />
<label>Revenue CPM ($) = <input name="revenueCPM" value="" /></label><br />
<br />
<h1>
Effect of Geo, Site, Device and DealID restrictions</h1>
<div>
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">DealID restrictions were in effect for all 13 line_items, so effect cannot be measured.</span></span></h1>
</div>
<div>
<span style="font-size: small;"><span style="font-weight: normal;">All restrictions data:</span></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghudvDX3xDr_FL-lCo6a_2aRWpViHI58zw-ubElwxP6vay86CKGwtSjHYZDgQp7t9W5PyxApMG5QCEJTgBA62hEF7SrW4WDPv8lRqlWynl7v2lMp480cKJE3Z7Mve4oHx2-xRbWUGUCpQm/s1600/restrictions_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghudvDX3xDr_FL-lCo6a_2aRWpViHI58zw-ubElwxP6vay86CKGwtSjHYZDgQp7t9W5PyxApMG5QCEJTgBA62hEF7SrW4WDPv8lRqlWynl7v2lMp480cKJE3Z7Mve4oHx2-xRbWUGUCpQm/s320/restrictions_data.jpg" /></a></div>
<div>
<span style="font-size: small;"><span style="font-weight: normal;">As one can see from the following plots, none of the other three restrictions which were variously present/absent in the tested line_items show any significant effect on either the coefficient (CPM at $10K, multiplicative effect) or the elasticity (exponent, additive effect).</span></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVblws7cr_u_7OkK5WtfXfKg5IqbpSIl5LLgAWYl7MC1HJcl83sZPKGORpIGMmcpIyE022BTZZQU5OOU0yPb8vjBs8CbnNMeXqim62eTeBT-yjIiZEJSP-6fMpzaT4FKAJ-SCLqOw_8W3T/s1600/geoPlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVblws7cr_u_7OkK5WtfXfKg5IqbpSIl5LLgAWYl7MC1HJcl83sZPKGORpIGMmcpIyE022BTZZQU5OOU0yPb8vjBs8CbnNMeXqim62eTeBT-yjIiZEJSP-6fMpzaT4FKAJ-SCLqOw_8W3T/s320/geoPlot.jpg" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVYuDh6W-WDo2rTz39b5DsG005ELY1HehYCFSNfRDvMZNG0uekrW1h1735_IeEglUvpGwpEuPCIAT7avHpy_QRY-EAWUrImsYjLzPT8X8MhyphenhypheniBjFaZjZKmOQbU-IyxJnw_8yO-x7oEtHXT/s1600/SitePlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVYuDh6W-WDo2rTz39b5DsG005ELY1HehYCFSNfRDvMZNG0uekrW1h1735_IeEglUvpGwpEuPCIAT7avHpy_QRY-EAWUrImsYjLzPT8X8MhyphenhypheniBjFaZjZKmOQbU-IyxJnw_8yO-x7oEtHXT/s320/SitePlot.jpg" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhn77qf87Oi7rGJzJ3OUHpdKly5M46k3Gd_hbeLaboyB9mG4QRm-LULPIhrCEdLlJe0BBgfigNqFzNpOEByFKw3fBglhyThtUYh3p58eXX86O5mpO5s8ppt-w2wi2AVm4MQKUh7pxPtHsUe/s1600/DevicePlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhn77qf87Oi7rGJzJ3OUHpdKly5M46k3Gd_hbeLaboyB9mG4QRm-LULPIhrCEdLlJe0BBgfigNqFzNpOEByFKw3fBglhyThtUYh3p58eXX86O5mpO5s8ppt-w2wi2AVm4MQKUh7pxPtHsUe/s320/DevicePlot.jpg" /></a></div>
<div>
<span style="font-size: small;"><span style="font-weight: normal;"><br /></span></span></div>
</form>
</html>
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-79299971141128390752015-06-03T17:07:00.000-07:002015-06-04T20:47:12.587-07:00ClickMePushYou<!DOCTYPE html>
<html>
<head>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.11.2/jquery.min.js"></script>
<script>
$(document).ready(function(){
$("p").click(function(){
$(this).next().hide();
});
});
</script>
</head>
<body>
<h2>The Puzzle</h2>
<p><i>If you click on </i>me...</p>
<p><tt><em>I</em> will disappear.</tt></p>
<p><b>Try clicking <em>me</em> away!</b></p>
<h3>No, dooon't!</h3>
<h2>How did you do?</h2>
<p>Your score = 6 - (number of clicks + page_refreshes)</p>
</body>
</html>
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-89986577537199922022015-06-02T13:19:00.002-07:002015-06-02T15:50:54.202-07:00Player size restricted Video CPM as a function of Demand<title>Player size restricted Video CPM as a function of Demand</title>
<script type="text/javascript">
function round(value, decimals) {
return Number(Math.round(value+'e'+decimals)+'e-'+decimals);
}
function dailySpend() {
var annualSpend = document.calcForm1.annualSpend.value;
var dailySpendVar = '';
dailySpendVar = Math.round(1000*annualSpend/365.);
document.calcForm1.dailySpendVar.value = dailySpendVar;
console.log("Annual Spend ($Million) = " + annualSpend);
console.log("Daily Spend ($1K) = " + dailySpendVar);
// debugger;
return false;
}
function dailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var dailyCostVar = '';
dailyCostVar = Math.round((1-margin/100.)*(1000*annualSpend/365.)) ;
document.calcForm2.dailyCostVar.value = dailyCostVar;
console.log("Margin (%) = " + margin);
console.log("Daily Cost ($1K) = " + dailyCostVar);
return false;
}
function actualDailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var dailyCost = '';
var actualDailyCostVar = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCostVar = Math.round((spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost) ;
document.calcForm3.actualDailyCostVar.value = actualDailyCostVar;
console.log("Current Cost Fraction on PlayerSize > 400X300 Inventory (%) = " + costFraction);
console.log("Expected Spend Fraction on PlayerSize > 400X300 Inventory (%) = " + spendFraction);
console.log("Actual Daily Cost on PlayerSize > 400X300 Inventory ($1K) = " + actualDailyCostVar);
return false;
}
function CPM() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var CPM10K = document.calcForm4.CPM10K.value;
var exponent = document.calcForm4.exponent.value;
var dailyCost = '';
var actualDailyCost = '';
var costCPM = '';
var revenueCPM = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCost = (spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost ;
costCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent), 2)
revenueCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent)/(1-margin/100.), 2)
document.calcForm4.costCPM.value = costCPM;
document.calcForm4.revenueCPM.value = revenueCPM;
console.log("CPM at $10K daily cost = " + CPM10K);
console.log("elasticity = " + exponent);
console.log("Cost Based CPM = " + costCPM);
console.log("Revenue CPM = " + revenueCPM);
return false;
}
</script><br />
<h1>
Experiment description and Analysis</h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">Biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Player size > 400 X 300 and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively.
For each of these budget points we have the spend, cost and impressions in the player size inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsnKNeo8WFVRIV8ueVAfrFlasoCgTSyyRcFP8BqsLMvPKSemqDsKHa1hqYwVOD03aQZTy75BUoTlFNTtsRr799JLpyLEqxCYpOfy1tOumOVhTiaz2kGi03CFcZUqCxh5olL5fHgjkxS7VK/s1600/player_restricted.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsnKNeo8WFVRIV8ueVAfrFlasoCgTSyyRcFP8BqsLMvPKSemqDsKHa1hqYwVOD03aQZTy75BUoTlFNTtsRr799JLpyLEqxCYpOfy1tOumOVhTiaz2kGi03CFcZUqCxh5olL5fHgjkxS7VK/s320/player_restricted.jpg" width="320" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">We are clearly having trouble delivering at scale for the higher budget density - we are delivering only about 35%. For example, to hit the actual budget density points, every tactic in a line item (targeting different numbers of user buckets) should have had the same cost, yet on 5/19/15 the cost in "2 buckets" is only 129 instead of 379. Hence, while the goal was to explore a range 12X the current total video revenue, in practice we've only been able to get to about 4X. </span></span></h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
From querying our database</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZQbQJ05YwGrdJsfYalRx09-Q07UJr4oX3Y9mo1ur0zWm7Y9nKEu6lOJO-HPcEL5EaOWa9HrIIGKFL0Vqej4RH_HMEnYjYJrSizYPTFkJfrKFyTwUkJhRq492V0wVv0wi1pRVC-Psw3pnI/s1600/ad_width_fraction.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZQbQJ05YwGrdJsfYalRx09-Q07UJr4oX3Y9mo1ur0zWm7Y9nKEu6lOJO-HPcEL5EaOWa9HrIIGKFL0Vqej4RH_HMEnYjYJrSizYPTFkJfrKFyTwUkJhRq492V0wVv0wi1pRVC-Psw3pnI/s320/ad_width_fraction.jpg" width="163" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;"><br /></span></span></h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">we know that about 35% of all our impressions and 35% of video cost are for player size > 400X300, so for all the video tactics which weren't restricted by player size as part of this test, we uniformly apportioned 35% of spend and cost (assuming margin was same) and 35% impressions in the tested user buckets to player size restricted inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdQsXf25P2kn0DwwEKDGx_oDDhFmnTb-pVQzA_2HN8XqxfTKn-6zY8T8GRsMuGbqt5Fy-ubwG9Ro1PMIyMfHEEdgUZLza73lPFC3VQfz9m6LAzGWKATjd8DFlb8RSSchSJME8dqte_JkVi/s1600/non-test_tactics_bkts.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="117" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdQsXf25P2kn0DwwEKDGx_oDDhFmnTb-pVQzA_2HN8XqxfTKn-6zY8T8GRsMuGbqt5Fy-ubwG9Ro1PMIyMfHEEdgUZLza73lPFC3VQfz9m6LAzGWKATjd8DFlb8RSSchSJME8dqte_JkVi/s320/non-test_tactics_bkts.jpg" width="320" /></a></div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;"><br /></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
The above spend cost and impressions were added to those for the tested tactics, </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhykJyFpd9YH7by_HFtPMH_zX33zzSiEQIRcigTFgYJ-GLgt_NAEzOUpkGpcV6KUjNCv1qwGgWx33EKpaaICg-tIvJC0j5o53zIlM1LcbUdTxZBbgYbXgmQdpvdR73Zd5P7CuPGmBYGcDqm/s1600/player_restricted_all.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="93" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhykJyFpd9YH7by_HFtPMH_zX33zzSiEQIRcigTFgYJ-GLgt_NAEzOUpkGpcV6KUjNCv1qwGgWx33EKpaaICg-tIvJC0j5o53zIlM1LcbUdTxZBbgYbXgmQdpvdR73Zd5P7CuPGmBYGcDqm/s320/player_restricted_all.jpg" width="320" /></a></div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;"><br /></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">and CPM vs daily cost was plotted and fitted to a power law. </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6VXVSsNW2Fobu7KQbS2mkH2uRKUWns0hot5QrCj4nkccdD1K-grQHqODSOmss4bTISFFTqe61jxBJc28S2Acv2axp8cryY7jCFCFs2oFtU1RO-AAU-pNuxrJ5NE615Q0XL9iYJfnb8D_3/s1600/plot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6VXVSsNW2Fobu7KQbS2mkH2uRKUWns0hot5QrCj4nkccdD1K-grQHqODSOmss4bTISFFTqe61jxBJc28S2Acv2axp8cryY7jCFCFs2oFtU1RO-AAU-pNuxrJ5NE615Q0XL9iYJfnb8D_3/s320/plot.jpg" width="320" /></a></div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;"><br /></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law. </span></span></h1>
<h1>
The Calculator</h1>
<br />
<form action="" name="calcForm1" onsubmit="return dailySpend();">
<label>Enter Annual Spend in all Video ($Million) = <input name="annualSpend" value="80" /></label><br />
<input type="submit" value="Calculate Daily Spend" /><br />
<label>Daily Spend in all Video ($1K) = <input name="dailySpendVar" value="" /></label></form>
<br />
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
dailyCost = (1-margin)*dailySpend
</span></span></h1>
<br />
<form action="" name="calcForm2" onsubmit="return dailyCost();">
<label>Enter Margin (%) = <input name="margin" value="40" /></label><br />
<input type="submit" value="Calculate Daily Cost" /><br />
<label>Daily Cost in all Video ($1K) = <input name="dailyCostVar" value="" /></label><br />
<br /></form>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost
</span></span></h1>
<br />
<form action="" name="calcForm3" onsubmit="return actualDailyCost();">
<label>Enter Current PlayerSize > 400X300 Inventory Cost fraction (%) = <input name="costFraction" value="35" /></label><br />
<label>Enter Fraction of Video spend in PlayerSize > 400X300 restricted campaigns (%) = <input name="spendFraction" value="100" /></label><br />
<input type="submit" value="Calculate ..." /><br />
<label>Actual Daily Cost in PlayerSize > 400X300 Video ($1K) = <input name="actualDailyCostVar" value="" /></label><br />
<br /></form>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost <br />
Revenue CPM = CostCPM/(1-margin)
</span></span></h1>
<br />
<form action="" name="calcForm4" onsubmit="return CPM();">
<label>Enter Cost based CPM at $10K Daily cost in PlayerSize > 400X300 = <input name="CPM10K" value="4.4" /></label><br />
<label>Enter Elasticity of CBCPM vs Cost, exponent in power law = <input name="exponent" value="0.44" /></label><br />
<input type="submit" value="Calculate CostCPM and Revenue CPM" /><br />
<label>Cost Based CPM ($) = <input name="costCPM" value="" /></label><br />
<label>Revenue CPM ($) = <input name="revenueCPM" value="" /></label><br />
<br />
<h1>
</h1>
</form>
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-67127956699248749402015-04-27T17:12:00.000-07:002015-04-28T01:28:33.013-07:00Annual Video spend to Cost CPM calculator<title>Video CPM as a function of Demand</title>
<script type="text/javascript">
function round(value, decimals) {
return Number(Math.round(value+'e'+decimals)+'e-'+decimals);
}
function dailySpend() {
var annualSpend = document.calcForm1.annualSpend.value;
var dailySpendVar = '';
dailySpendVar = Math.round(1000*annualSpend/365.);
document.calcForm1.dailySpendVar.value = dailySpendVar;
console.log("Annual Spend ($Million) = " + annualSpend);
console.log("Daily Spend ($1K) = " + dailySpendVar);
// debugger;
return false;
}
function dailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var dailyCostVar = '';
dailyCostVar = Math.round((1-margin/100.)*(1000*annualSpend/365.)) ;
document.calcForm2.dailyCostVar.value = dailyCostVar;
console.log("Margin (%) = " + margin);
console.log("Daily Cost ($1K) = " + dailyCostVar);
return false;
}
function actualDailyCost() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var dailyCost = '';
var actualDailyCostVar = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCostVar = Math.round((spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost) ;
document.calcForm3.actualDailyCostVar.value = actualDailyCostVar;
console.log("Current Cost Fraction on Tiered Inventory (%) = " + costFraction);
console.log("Expected Spend Fraction on Tiered Inventory (%) = " + spendFraction);
console.log("Actual Daily Cost on Tiered Inventory ($1K) = " + actualDailyCostVar);
return false;
}
function CPM() {
var annualSpend = document.calcForm1.annualSpend.value;
var margin = document.calcForm2.margin.value;
var costFraction = document.calcForm3.costFraction.value;
var spendFraction = document.calcForm3.spendFraction.value;
var CPM10K = document.calcForm4.CPM10K.value;
var exponent = document.calcForm4.exponent.value;
var dailyCost = '';
var actualDailyCost = '';
var costCPM = '';
var revenueCPM = '';
dailyCost = (1-margin/100.)*(1000*annualSpend/365.) ;
actualDailyCost = (spendFraction/100 + (1-spendFraction/100)*costFraction/100)*dailyCost ;
costCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent), 2)
revenueCPM = round(CPM10K * Math.pow(actualDailyCost/10.,exponent)/(1-margin/100.), 2)
document.calcForm4.costCPM.value = costCPM;
document.calcForm4.revenueCPM.value = revenueCPM;
console.log("CPM at $10K daily cost = " + CPM10K);
console.log("elasticity = " + exponent);
console.log("Cost Based CPM = " + costCPM);
console.log("Revenue CPM = " + revenueCPM);
return false;
}
</script><br />
<h1>
Experiment description and Analysis</h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">13 biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Tier I-III inventory and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively.
For each of these budget points we have the spend, cost and impressions in the TierI-III inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1YCaVhjLQ5PoLRF9hoXIPSFPjbkbEoPlx4vTsMgnjfX2kxZagq5r1wpyqblFvKLm66HUPgxkdvNSR8411YhkbZIwD-rnpNlnUJSeRKK-BOrluy8FvuGJYE_TZ88q53o64ih9Dm3atbqri/s1600/TierI-III_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1YCaVhjLQ5PoLRF9hoXIPSFPjbkbEoPlx4vTsMgnjfX2kxZagq5r1wpyqblFvKLm66HUPgxkdvNSR8411YhkbZIwD-rnpNlnUJSeRKK-BOrluy8FvuGJYE_TZ88q53o64ih9Dm3atbqri/s1600/TierI-III_data.jpg" height="196" width="320" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
From other sources (Atul's presentation) we know that about 15% of all our impressions (50% of all avails) are in Tier I-III, so for all the video tactics which weren't restricted to TierI-III as part of this test, we uniformly apportioned 20% of spend and cost (assuming margin was same) and 15% impressions in the tested user buckets to TierI-III inventory.</span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqSfbFwtsogtlDnQhII38NySDamlenjeTjRzpExEWoREQHw6hIZnXnDN4U5MMuFLM8b9AXjzCqCvETF2_e8guF12GI9WqQfwPim2EednMYbNHiujN54Ru8GqZMCNJo8Mf3Iqtmu2JbokMZ/s1600/non-test_restricted_bkts_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqSfbFwtsogtlDnQhII38NySDamlenjeTjRzpExEWoREQHw6hIZnXnDN4U5MMuFLM8b9AXjzCqCvETF2_e8guF12GI9WqQfwPim2EednMYbNHiujN54Ru8GqZMCNJo8Mf3Iqtmu2JbokMZ/s1600/non-test_restricted_bkts_data.jpg" height="158" width="400" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
The above spend cost and impressions were added to those for the tested tactics, </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuNzcX6z1J0lfAeMHpVJ6RpQYaSFRaYSnPQ67x40Kil-LXzLk9OSQ4xACu0npNh2VPB7eR8-RZCaID79yAfbLdhmtC1SgB9SxgmvSBaP81PlrqOlkRGcCxJA5dTyr2BcbwSu88coTQkyCc/s1600/restricted_bkts_all_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuNzcX6z1J0lfAeMHpVJ6RpQYaSFRaYSnPQ67x40Kil-LXzLk9OSQ4xACu0npNh2VPB7eR8-RZCaID79yAfbLdhmtC1SgB9SxgmvSBaP81PlrqOlkRGcCxJA5dTyr2BcbwSu88coTQkyCc/s1600/restricted_bkts_all_data.jpg" height="148" width="640" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">and CPM vs daily cost was plotted and fitted to a power law. </span></span></h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh2QUhxaYjzQjMaKKcSjYAtgPjFSy7Lr4_4MTpdkkCZ2MgEq2IIIyboVswZc_lVEX8UJM1SGLVaQu_MNvYHjvom0JY1N1N-lpU9iKZY7WXiCPbVgJKFcCgVB9rW9pRUx9LCv5CaIz3-4h6/s1600/CPM_elasticity_cost.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh2QUhxaYjzQjMaKKcSjYAtgPjFSy7Lr4_4MTpdkkCZ2MgEq2IIIyboVswZc_lVEX8UJM1SGLVaQu_MNvYHjvom0JY1N1N-lpU9iKZY7WXiCPbVgJKFcCgVB9rW9pRUx9LCv5CaIz3-4h6/s1600/CPM_elasticity_cost.jpg" height="186" width="320" /></a></div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law. </span></span></h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">How do I put in formulas?
dailySpend = AnnualSpend/365
</span></span></h1>
<br />
<form action="" name="calcForm1" onsubmit="return dailySpend();">
<label>Enter Annual Spend in all Video ($Million) = <input name="annualSpend" value="80" /></label><br />
<input type="submit" value="Calculate Daily Spend" /><br />
<label>Daily Spend in all Video ($1K) = <input name="dailySpendVar" value="" /></label></form>
<br />
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
dailyCost = (1-margin)*dailySpend
</span></span></h1>
<br />
<form action="" name="calcForm2" onsubmit="return dailyCost();">
<label>Enter Margin (%) = <input name="margin" value="40" /></label><br />
<input type="submit" value="Calculate Daily Cost" /><br />
<label>Daily Cost in all Video ($1K) = <input name="dailyCostVar" value="" /></label><br />
<br /></form>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost
</span></span></h1>
<br />
<form action="" name="calcForm3" onsubmit="return actualDailyCost();">
<label>Enter Current Tier I-III Inventory Cost fraction (%) = <input name="costFraction" value="20" /></label><br />
<label>Enter Fraction of Video spend in Tier restricted campaigns (%) = <input name="spendFraction" value="100" /></label><br />
<input type="submit" value="Calculate ..." /><br />
<label>Actual Daily Cost in Tier I-III Video ($1K) = <input name="actualDailyCostVar" value="" /></label><br />
<br /></form>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">
Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost <br />
Revenue CPM = CostCPM/(1-margin)
</span></span></h1>
<br />
<form action="" name="calcForm4" onsubmit="return CPM();">
<label>Enter Cost based CPM at $10K Daily cost in TierI-III = <input name="CPM10K" value="3.7" /></label><br />
<label>Enter Elasticity of CBCPM vs Cost, exponent in power law = <input name="exponent" value="0.15" /></label><br />
<input type="submit" value="Calculate CostCPM and Revenue CPM" /><br />
<label>Cost Based CPM ($) = <input name="costCPM" value="" /></label><br />
<label>Revenue CPM ($) = <input name="revenueCPM" value="" /></label><br />
<br />
<h1>
Effect of Geo, Site, Device and DealID restrictions</h1>
<div>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">DealID restrictions were in effect for all 13 line_items, so effect cannot be measured.</span></span></h1>
</div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;">All restrictions data:</span></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB4S20MKX5ovFN62ijPzMwDY12dG6Q8Yx70DQfMcNJNrPRMAJ6OyFhsGA_vYBsnt2nBjXDNAsQzDaCrywxQS5A4XNtExnz_7xDl6lJegL9L7COJg4x2VvHz-V1wThdLp-C4_qX6IaInHv3/s1600/restrictions_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB4S20MKX5ovFN62ijPzMwDY12dG6Q8Yx70DQfMcNJNrPRMAJ6OyFhsGA_vYBsnt2nBjXDNAsQzDaCrywxQS5A4XNtExnz_7xDl6lJegL9L7COJg4x2VvHz-V1wThdLp-C4_qX6IaInHv3/s1600/restrictions_data.jpg" height="140" width="640" /></a></div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;">As one can see from the following plots, none of the other three restrictions which were variously present/absent in the tested line_items show any significant effect on either the coefficient (CPM at $10K, multiplicative effect) or the elasticity (exponent, additive effect).</span></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyg9GUzsS3yXWaxU615d4-vsxMEMwWimGs0mJljls4jOI9xaadKmsxOUMIIa4rGyBDX7XIC7ky6sMp_xZFlo3Lc6dxjvjJISjhGc14v9p1ZTPJJ2eMYQ0Q9VRNFnHEBkxQXCsGcpc-kRWj/s1600/geoPlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyg9GUzsS3yXWaxU615d4-vsxMEMwWimGs0mJljls4jOI9xaadKmsxOUMIIa4rGyBDX7XIC7ky6sMp_xZFlo3Lc6dxjvjJISjhGc14v9p1ZTPJJ2eMYQ0Q9VRNFnHEBkxQXCsGcpc-kRWj/s1600/geoPlot.jpg" height="195" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEvY0JgDm6Tj6IEyIcOzQKFBjNudvk3IU4bW87zB7L1K6Bq6kbf78K3GJ_z7Y3aw7gzD68GVMsw9_luf10cgFWLNFNzXkgUTDiAY278wJttw1NZx6vgazVsLrlxh5JZgq_YnpVfN9Ri3i7/s1600/SitePlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEvY0JgDm6Tj6IEyIcOzQKFBjNudvk3IU4bW87zB7L1K6Bq6kbf78K3GJ_z7Y3aw7gzD68GVMsw9_luf10cgFWLNFNzXkgUTDiAY278wJttw1NZx6vgazVsLrlxh5JZgq_YnpVfN9Ri3i7/s1600/SitePlot.jpg" height="194" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNuAC8YH5LoJCNkXhiJYk8jdjPREflFcmLmJnYnjLfizLJCWljM8gbLnlKSrak3L_2E4pBb2uockAFhi4owb617CnfvYtR7CPHhuWB98zlJIeGlZ3wWG6XIUHC4ju3nwOL8_9MSv4PNpy9/s1600/DevicePlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNuAC8YH5LoJCNkXhiJYk8jdjPREflFcmLmJnYnjLfizLJCWljM8gbLnlKSrak3L_2E4pBb2uockAFhi4owb617CnfvYtR7CPHhuWB98zlJIeGlZ3wWG6XIUHC4ju3nwOL8_9MSv4PNpy9/s1600/DevicePlot.jpg" height="193" width="320" /></a></div>
<div>
<span style="font-size: large;"><span style="font-weight: normal;"><br /></span></span></div>
</form>
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-25418204713158419682012-03-30T12:52:00.000-07:002012-03-30T12:52:38.378-07:00NTA7:Dimensional Sparseness Or What Tokens To Use?NTA7: Dimensional Sparseness Or What
Tokens To Use?<br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to all Numerical Text Analysis posts</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/constructing-histograms-for-corpus.html" target="_blank">Word histogram and Zipf plot of corpus for identifying stop words</a><br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
After
</div>
<ul>
<li><div style="margin-bottom: 0in;">
lowercasing,
</div>
</li>
<li><div style="margin-bottom: 0in;">
depunctuating,
</div>
</li>
<li><div style="margin-bottom: 0in;">
removing single atom words,
</div>
</li>
<li><div style="margin-bottom: 0in;">
removing single occurrence tokens,
</div>
</li>
<li><div style="margin-bottom: 0in;">
stemming and removing a few stop
words (whether the 125 I have chosen or the ~ 450 in standard lists,</div>
</li>
</ul>
<div style="margin-bottom: 0in;">
we are left with about 14000 tokens
from a bit less than 100 documents.
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Here is the Zipf plot:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZjGVADvOSIQ5HZRKM1jQH4B-ZAx95TJ8UtPO2VzMHY3DP9ZYPrxo9PYQSgJ2kjxAe3RNoRbo32f0opjjosQoCHq6c8DWdUreViheW6_295V0YmS0cws6JTeir6wg7SPDniznvqpdK8D02/s1600/UseAll.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="468" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZjGVADvOSIQ5HZRKM1jQH4B-ZAx95TJ8UtPO2VzMHY3DP9ZYPrxo9PYQSgJ2kjxAe3RNoRbo32f0opjjosQoCHq6c8DWdUreViheW6_295V0YmS0cws6JTeir6wg7SPDniznvqpdK8D02/s640/UseAll.tiff" width="640" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Before going on to the Cosine
Similarity business, let's think about Word-vector space. Each token
represents a dimension, the frequency of occurrence is the
coordinate. Each document represents a point in this space. So what
we have is a 14,000 dimensional space for a 100 points. That's
ridiculous! But let's consider we have a million documents, that is
still only 70 points per dimension!
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Now of course this is going to vary
from case to case, e.g. the categorization problem I worked on had
25,000 points in 50,000 dimensions. Hardly a situation where standard
clustering algorithms – most of which are designed for denser
graphs (at least dimensionally speaking) can be expected to apply. My
solution to this is on SlideShare, and I'll blob on it at length
later.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
My point, as was my point for selecting
stop words, is that some [statistical or numerical] and purpose-based
criteria have to be used to decide which set of tokens to use. If you
look at the Zipf plot above, in my view there are clearly three
families of tokens:
</div>
<ul>
<li><div style="margin-bottom: 0in;">
the high ranking ones that
approximate a Zipf line with a coefficient marginally less than 1</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrY6h3gZzruf7Bj6wzfRSiYSG1AuRBGwrXRrCKzkuDEX0MzKneeEEUpGHe42_rtvpeZb9Cesjq5FOmyinEz1hLUFSeWlHI92OjH8UyL_XWKmMgfDxhrFwOCWa8SZyALJ77SjysxlUdFPLa/s1600/UseHigh.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="291" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrY6h3gZzruf7Bj6wzfRSiYSG1AuRBGwrXRrCKzkuDEX0MzKneeEEUpGHe42_rtvpeZb9Cesjq5FOmyinEz1hLUFSeWlHI92OjH8UyL_XWKmMgfDxhrFwOCWa8SZyALJ77SjysxlUdFPLa/s400/UseHigh.tiff" width="400" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
</li>
<li><div style="margin-bottom: 0in;">
the mid ranking tokens at the knee
of the plot which fall off the Zipf line</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidoAkSLDfD165Q2MDxMUps-0TZyYPKJTDoQtqseY6cXfjn1Nmlp8k9a2DJLB9sum8OkPSOb4TUDEtWdwEvk8KEZ1txM4tfUFk19gC94bmLfL9hIYz_UxpKy_ZeFiCJm7lhGnfMJVKKchVt/s1600/UseLow.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidoAkSLDfD165Q2MDxMUps-0TZyYPKJTDoQtqseY6cXfjn1Nmlp8k9a2DJLB9sum8OkPSOb4TUDEtWdwEvk8KEZ1txM4tfUFk19gC94bmLfL9hIYz_UxpKy_ZeFiCJm7lhGnfMJVKKchVt/s400/UseLow.tiff" width="400" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
</li>
<li><div style="margin-bottom: 0in;">
the low ranking tokens whose power
law deviates very strongly from 1</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgaZ7Ur0bgla5Akt-2rOMjmxSBqTGvu8UY9Vfbzk7zYI2p-U-6GbXjfzRK_iQpp0E7OZVTPYDQ5SA_OwN1EP3XZ5rOl39KF-JaPquGLw0RAj3cmJkD5FrOv5xP6AaMWV6oJPk49YgBFcw50/s1600/UseMid.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="281" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgaZ7Ur0bgla5Akt-2rOMjmxSBqTGvu8UY9Vfbzk7zYI2p-U-6GbXjfzRK_iQpp0E7OZVTPYDQ5SA_OwN1EP3XZ5rOl39KF-JaPquGLw0RAj3cmJkD5FrOv5xP6AaMWV6oJPk49YgBFcw50/s400/UseMid.tiff" width="400" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
</li>
</ul>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
My choices for the cutoffs are somewhat
arbitrary, but unimportant for now. Summarizing,
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBSipKLzXHa1TVQ_k5pVFHdAudMdSQwOnwqAykHgTcXRL_eMkUgKVbfT8DkueS9IQjKcNZd8I7UXl6I5QpOTAPAUa4WGn8F57AT1T-6c6fLpQgw9SUHYNZ0LcXAMvDhfeZMwoiZ1n0MqOE/s1600/UseSummary.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBSipKLzXHa1TVQ_k5pVFHdAudMdSQwOnwqAykHgTcXRL_eMkUgKVbfT8DkueS9IQjKcNZd8I7UXl6I5QpOTAPAUa4WGn8F57AT1T-6c6fLpQgw9SUHYNZ0LcXAMvDhfeZMwoiZ1n0MqOE/s640/UseSummary.tiff" width="640" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
What do these mean? Consider the
extremely rare ones that occur only once in the corpus, and hence in
only one document. So they are extremely powerful for distinguishing
one document from another and a reasonably small set of these plus a
few that occur a few times (There may be a few documents which
contain no singly occurring tokens.) will be sufficient to label
every document uniquely. However, precisely because of their rarity,
we can't use them for finding any similarities between texts. So
these low ranking tokens lead to a very fine-graining of the set of
documents. Rejecting these will have the advantage of reducing the
dimensionality of word space by an order of magnitude.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
What about the high-ranking ones? They
suffer from the converse problem, they occur so often in so many
texts that they are likely to not be able to distinguish well between
texts at all, which is sort of the motivation for throwing out the
stop words in the first place. (Though my suspicion is that stop
words are, or should be, chosen based on the smallness of the
standard deviation of their relative frequencies.) So short of
calculating these frequency standard deviations for the tokens, which
I hope to get to soon, we should reject these stochastic tokens,
the ones that closely follow Zipf's law (or Mandelbrot's
modification). Rejecting these will <i>not</i><span style="font-style: normal;">
reduce the dimensionality of word-vector space by much, but it </span><span style="font-style: normal;"><b>will</b></span><span style="font-style: normal;"><span style="font-weight: normal;">
reduce the radial distance of the document-points from the origin,
the </span></span><i><span style="font-weight: normal;">de facto</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
center for the Cosine Similarity.</span></span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">So
we can use only 1500 mid-ranking tokens (that is “only” 15
dimensions per point, which is better than 144 dimensions per point)
for the next steps. </span></span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">Last
point: before calculating Cosine Similarities, a standard step is to
scale all the token - frequencies by the log of the document
frequency for the token. This is a flat metric on wordspace. This
will have the effect of bringing down the weight at the high-ranking
end, and since log(f) increases slower than f, not scaling down as
much the low ranking end. So the overall weights of the tokens may
show precisely the bump in the mid-range that we want. So another
point for comparison. </span></span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
So many choices, so little time.</div>
<div style="margin-bottom: 0in;">
<br /></div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-8913808865781553722012-03-29T09:20:00.000-07:002012-03-29T11:28:55.179-07:00NTA6: Constructing Histograms For A Corpus<a href="http://tatetech.blogspot.com/2012/03/wordhistogramsorhamletquestion1.html" target="_blank">Constructing Histograms</a> For A Corpus : Numerical Text Analysis 6<br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to all Numerical Text Analysis Posts </a><br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
For the corpus, or the set of documents
I am going to analyse, I've chosen the top 100 books over the last 30
days (as of 3/27/2012) from <a href="http://www.gutenberg.org/browse/scores/top">Project
Gutenberg</a>. (Incidentally, Vatsayana's <i>Kamasutra</i><span style="font-style: normal;">
is a perennial number 3, and this accords with Amartya Sen's footnote
(pg. 25, </span><i>The Contemplative Indian</i><span style="font-style: normal;">):
</span><i>In fairness to Western expertise on India, it must be
conceded that there has never been any lack of Occidental interest in
what may be called the 'carnal sciences', led by the irrepressible
</i><span style="font-style: normal;">Kamasutra</span><i> and
</i><span style="font-style: normal;">Anangaranga</span><i>.</i><span style="font-style: normal;">)
I</span> looked at the source code for the page, copied it, and did
regular expression searches for the serial numbers of the books, then
wrote Python code to suck them from the website, keeping my fingers
crossed that they wouldn't think I = Robot and block my IP address.
Some of the .txt files turned out to be empty. I chopped off the
beginning (Project Gutenberg info.) and the last bit ( license info.)
for each document.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
As a reminder, the steps for processing
text are:</div>
<ol>
<li><div style="margin-bottom: 0in;">
Lowercase, depunctuate and stem
the words in the text of each document.</div>
</li>
<li><div style="margin-bottom: 0in;">
Construct a word
histogram for each document, for all the tokens. For example, for eBook3600:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8TtO3eiy_t1ihAWTaBuTKo1ypN_97B_HQ9WEIF_A-oCE3t4K9ropO_qW7r9x5V6UHio-ojI2c-SmhGtP5J_crl_kX0U7kjJUKDu9zkiDpNN5r6uiKvscdUK7R59fHT0x3TT-DfOk8O05v/s1600/_eBook3600StemZipf.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8TtO3eiy_t1ihAWTaBuTKo1ypN_97B_HQ9WEIF_A-oCE3t4K9ropO_qW7r9x5V6UHio-ojI2c-SmhGtP5J_crl_kX0U7kjJUKDu9zkiDpNN5r6uiKvscdUK7R59fHT0x3TT-DfOk8O05v/s1600/_eBook3600StemZipf.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf plot for eBook3600</td></tr>
</tbody></table>
</div>
</li>
<li><div style="margin-bottom: 0in;">
Construct the histogram for the
corpus: for each token, add the occurrence frequencies in all the
documents in the corpus.</div>
</li>
<li><div style="margin-bottom: 0in;">
Identify the rare tokens – those
that occur a total of only one time in the entire corpus. Why remove
any low-frequency tokens at all? There are a large number of non
standard-English letters (possibly some of the documents are in
languages other than English or there may be a few words from other
languages or simple mitsakes) and non-ASCII and non-UTF-8
punctuation marks (for example Sanskrit diacritical notation from
the book ranked number 3 above) that my algorithm doesn't pick up.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijQ0vBvgdIDccwVQicMyW5MqeYb5Q8SOT1HtLC_hpK9k9-pcjX4g-Pnx0ozG3BbYZPP27I2VbSOdyARfTL77lzjUFBrH-v1D-SIHwVjrEJydxzQFoRL_4kXOKxaEFqpXmbt40tciS7Xh5o/s1600/_CorpusBot20_1.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijQ0vBvgdIDccwVQicMyW5MqeYb5Q8SOT1HtLC_hpK9k9-pcjX4g-Pnx0ozG3BbYZPP27I2VbSOdyARfTL77lzjUFBrH-v1D-SIHwVjrEJydxzQFoRL_4kXOKxaEFqpXmbt40tciS7Xh5o/s1600/_CorpusBot20_1.tiff" /></a></div>
</div>
<div style="margin-bottom: 0in;">
I thought that restricting the allowed
atoms to lowercase unaccented Latin letters would rule out too many
accented letters, and played with removing small frequency words. It
turns out that even for a 100 documents, ruling out the unipresent
tokens is enough since the remaining bipresent tokens seem quite
reasonable.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYmcUgHLmEO2ZuuD-FZWGXppb9nEQBFBMleZZ1aIUqmXX2K4ue_D491tpwlswF9OC5O8F7OdGSXoOULtcBfxKdKPZ75fXPGR5k_AU5ibQVgU0KPhHQqEGoUmlO8uvbj13MEf3hXCwzRw2N/s1600/_Corpus2Bott20_1.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYmcUgHLmEO2ZuuD-FZWGXppb9nEQBFBMleZZ1aIUqmXX2K4ue_D491tpwlswF9OC5O8F7OdGSXoOULtcBfxKdKPZ75fXPGR5k_AU5ibQVgU0KPhHQqEGoUmlO8uvbj13MEf3hXCwzRw2N/s1600/_Corpus2Bott20_1.tiff" /></a></div>
</div>
</li>
<li><div style="margin-bottom: 0in;">
Identify the small tokens, that
is, the single character ones except “a” and “i”, and put
both the small and the rare tokens in a list of raw words to be
removed. The few small tokens can be removed from all the documents
in the corpus, but since the rarity of a token can change as
documents are added to the corpus the rare tokens should be left in
place, even though the number of rare tokens will increase almost
linearly with the number of words in the corpus whereas the number
of higher frequency words will increase much slower.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFQf-sIiDLUtZ9B4j53rc50l7lY3ANl1xOu75KI38a8Hw-xhNzTKcEHcj9IHxDlteBUaFstW8WaTkTFtTrrWDGTluTn61a3elsdyhXsW1C2vS6OHqHe26eXUhcKyLrPU8gDSUqXNWKl_lx/s1600/_CorpusSumm.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFQf-sIiDLUtZ9B4j53rc50l7lY3ANl1xOu75KI38a8Hw-xhNzTKcEHcj9IHxDlteBUaFstW8WaTkTFtTrrWDGTluTn61a3elsdyhXsW1C2vS6OHqHe26eXUhcKyLrPU8gDSUqXNWKl_lx/s1600/_CorpusSumm.tiff" /></a></div>
</div>
<div style="margin-bottom: 0in;">
From which you can see<br />
<span style="font-style: normal;"></span><br />
<ul>
<li>The number of unipresent tokens is
about half of <i>all</i><span style="font-style: normal;"> tokens,
but represent marginally few words.</span></li>
<li><span style="font-style: normal;">There
are a surprisingly large number (132) of unidentified atoms, which
on average occur multiple times.</span></li>
<li><span style="font-style: normal;">If
you think that English is a gender-neutral language, <a href="http://pol-imer.blogspot.com/2012/03/is-english-gender-neutral.html" target="_blank">think again</a>.</span></li>
</ul>
</div>
</li>
</ol>
<ol start="6">
<li><div style="margin-bottom: 0in;">
Zipf plot of the entire corpus,<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs97sT06A1LPiDrI8zWRSxEstVSfPlQ4qZoOKtwvtADGkSPO4U-1XZGUgwkIPSnN-GSerUIQo0DDWaVjEKBOn9wqbB3FhuYWZauyu13S_1SzmdLq_tCxglnw0o-pszXrmBkZb_TV87HqYX/s1600/_CorpusZipf.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs97sT06A1LPiDrI8zWRSxEstVSfPlQ4qZoOKtwvtADGkSPO4U-1XZGUgwkIPSnN-GSerUIQo0DDWaVjEKBOn9wqbB3FhuYWZauyu13S_1SzmdLq_tCxglnw0o-pszXrmBkZb_TV87HqYX/s1600/_CorpusZipf.jpeg" /></a></div>
</div>
<div style="margin-bottom: 0in;">
which should be <a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">used to determine</a> the
minimum rank to exclude “stop” words (deviations from flatness
at the high frequency end. Let's take a look at the high frequency
end in more detail:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1ALKtHZTCRfPH1tctYiwJZumw-WWAkkIuZZdqvgO2jcz_7Fv3CsFqO55pYJNk3HHwIVLAfqtVL_O4M_Ug_o_GooM9MIW1-2FlFgz64XvniG89HhOVv-6ZVeE-l61hLl7D2uzfEe_FMh0E/s1600/_CorpusZipHigh.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="427" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1ALKtHZTCRfPH1tctYiwJZumw-WWAkkIuZZdqvgO2jcz_7Fv3CsFqO55pYJNk3HHwIVLAfqtVL_O4M_Ug_o_GooM9MIW1-2FlFgz64XvniG89HhOVv-6ZVeE-l61hLl7D2uzfEe_FMh0E/s640/_CorpusZipHigh.tiff" width="640" /></a></div>
</div>
<div style="margin-bottom: 0in;">
from which I can see choosing a stop
rank anywhere between 50 and a 150, but I don't see at all why the
stop rank should be between 450 and 500 words, which is what
standard stop-word lists consist of. Ignorance => opportunity to
learn. However, for now I am going to go with the list of the
top-ranked 103 as the stopword list, which reverentially saves "god" "himself", which come in at #s124th and 125th.</div>
</li>
<li><div style="margin-bottom: 0in;">
Add the set of stop words to the
set of words to be removed. The complement of this subset in the set
of all tokens in the corpus is the set of tokens to be used for the
statistical calculations. The truncations of the histograms to this
“use-set” of words are what we will use for studying Cosine
Similarity, its alternatives, and what should really be done with
the histograms.
</div>
</li>
</ol>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
So in the next post, I'll lay out my
arguments comparing the ubiquitous Cosine Similarity to some simple
trignometric alternatives, with some simple illustrative examples.
The post after that will consider document similarity within the
above corpus using both Cosine and my proposed alternative.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Next what I'll propose is to go beyond
this approach to similarity, to really situate ourselves well in
word-vector space and see what really good measures we can construct
there, and of course, following all that theory, apply those ideas to
the documents in the corpus. Yum-yum!</div>
<div style="margin-bottom: 0in;">
<br /></div>
<br />ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-46232750399187733242012-03-27T00:47:00.000-07:002012-03-30T12:55:29.390-07:00Guide to Posts on text AnalysisThe posts make most logical sense in the following order:<br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/sampletextforanalysis.html" target="_blank">NTA0: Sample Text for Analysis</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysispreliminaries.html" target="_blank">NTA1: Preliminaries</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysisorhamletquestion_26.html" target="_blank">NTA2: Text Analysis Illustrative Example</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">NTA3: Selecting Stop Tokens</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/wordhistogramsorhamletquestion1.html" target="_blank">NTA4: Word Histograms</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/wordvectorspaceplay.html" target="_blank">NTA5: WordPlay</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/constructing-histograms-for-corpus.html" target="_blank">NTA6: Word Histogram for the Corpus</a><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/nta7dimensional-sparseness-or-what.html" target="_blank">NTA7: What tokens to use, dimensional Sparseness</a><br />
<br />ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com2tag:blogger.com,1999:blog-6519535535434149951.post-57057327701125882772012-03-26T13:05:00.002-07:002012-03-29T11:28:40.360-07:00NTA5: WordPlayNumerical Text Analysis 5<br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to All Text Analysis Posts</a><br />
<br />
Prev: <a href="http://tatetech.blogspot.com/2012/03/wordhistogramsorhamletquestion1.html" target="_blank">Constructing Word Histograms</a><br />
<br />
This is not serious stuff, so indulge
me here. I am just playing with a cute idea which probably has zipf
practical applications.<br />
Read this after reading about the <a href="http://tatetech.blogspot.com/2012/03/wordhistogramsorhamletquestion1.html" target="_blank">construction of word histogram space</a>. <br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Since I'll be analysing text, and
writing text to do so, let me distinguish text to be analysed by
putting it on a gray background. For example, the result of the
action of the lower-casing operator on some text can be shown as</div>
<div style="margin-bottom: 0in;">
<b>low</b><span style="font-weight: normal;">:</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">ThiS
is tHe TeXt fOR makiNG LowER case</span></div>
<div style="margin-bottom: 0in;">
=</div>
<div style="background: #808080; margin-bottom: 0in;">
this is the text
for making lower case</div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
Before
proceeding I just want to point out that it is not really the <i>text</i><span style="font-style: normal;">
I am interested in as in the histogram of the text. So </span>
</div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;">LOWER
case </span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;">and
</span>
</div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;">case
LOWER </span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;">are
the same as vectors or word-histograms, as are:</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;">text
text text</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;">=</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;">3
*</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;">text
</span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;"><i>Eigen Wha?</i>: Just to refresh our memories, a linear operator on a vector space acts on any vector and gives you back another vector in the space. If the result of the action of the operator on a specific vector is to return that same vector multiplied by a scalar, that vector is called an <b>eigenvector</b> of the operator and the corresponding scalar is an <b>eigenvalue of the operator</b>. The determinant of the operator is the product of its eigenvalues. The eigenvalues are solutions to det(oper. - lambda* Identity Op.) = 0.</span><br />
<br />
<span style="font-style: normal;">An</span>
eigenvector of the <b>low</b><span style="font-weight: normal;">
operator with eigenvalue 1 is</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">upper</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">,
since</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<b>low</b><span style="font-weight: normal;">:</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">upper</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">=
1*</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">upper</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">Let's
just look at the </span><b>low</b><span style="font-weight: normal;">
operator in a bit more detail. If I restrict my word space to only
'a', both lower and upper case, then it is two dimensional. Let me
represent text in this space by the ordered pair of the number of
occourences of 'a' and 'A', (</span><i><span style="font-weight: normal;">x,
y</span></i><span style="font-style: normal;"><span style="font-weight: normal;">).
So the text </span></span>
</div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">a
A A a a </span></span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">is
represented by (</span></span><i><span style="font-weight: normal;">3,2</span></i><span style="font-style: normal;"><span style="font-weight: normal;">).
Now the action of the lowering operator is </span></span><span style="font-style: normal;"><b>low</b></span><span style="font-style: normal;"><span style="font-weight: normal;">(</span></span><i><span style="font-weight: normal;">x,y</span></i><span style="font-style: normal;"><span style="font-weight: normal;">)
= (</span></span><i><span style="font-weight: normal;">x+y,0</span></i><span style="font-style: normal;"><span style="font-weight: normal;">),
and the eigenvector with eigenvalue 1 is (</span></span><i><span style="font-weight: normal;">1,0</span></i><span style="font-style: normal;"><span style="font-weight: normal;">).
However, if you calculate the determinant of </span></span><span style="font-style: normal;"><b>low</b></span><span style="font-style: normal;"><span style="font-weight: normal;">,
it turns out to be 0!, so we know that it has an eigenvector with 0
eigenvalue. In the above representation, this is (</span></span><i><span style="font-weight: normal;">1,-1</span></i><span style="font-style: normal;"><span style="font-weight: normal;">).
But what does '-1' occurences mean? We </span></span><i><span style="font-weight: normal;">know</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
that a word can't occur -1 times in a text! </span></span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">Or,
do we?</span></span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-style: normal;"><span style="font-weight: normal;">My
</span></span><span style="color: white;"><span style="font-style: normal;"><span style="font-weight: normal;">word</span></span></span><span style="font-style: normal;"><span style="font-weight: normal;">
but it can </span></span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">So
what are the rules? Well there is only one:</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">word
</span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">+</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="color: white;"><span style="font-weight: normal;">word</span></span><span style="font-weight: normal;">
</span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">=</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">word
</span><span style="color: white;"><span style="font-weight: normal;">word</span></span><span style="font-weight: normal;">
</span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">=</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">So
the 0-eigenvalue eigenvector of </span><b>low</b><span style="font-weight: normal;">
is the following text!</span></div>
<div style="background: #808080; margin-bottom: 0in;">
<span style="font-weight: normal;">a
</span><span style="color: white;"><span style="font-weight: normal;">A</span></span></div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">When we <i>read</i> text like this, we are only sensitive to the magnitude. </span></div>
<div style="background: none repeat scroll 0% 0% transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: none repeat scroll 0% 0% transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">What
we've accomplished is that we have extended word histogram space
(linear space but only in the positive 2</span><sup><span style="font-weight: normal;">n</span></sup><span style="font-weight: normal;">-ant)
to be a genuine, full-fledged, authentic VECTOR space! </span>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">This
raises the possibility –which, since I am a self-acknowledged
ignoramus in this field, I am sure has already been explored– of
using vector space and operator analysis to look at text. However, I
doubt that there is anything new there beyond what is already known
from standard statistical text analysis.</span></div>
<div style="background: transparent; margin-bottom: 0in;">
<br /></div>
<div style="background: transparent; margin-bottom: 0in;">
<span style="font-weight: normal;">In
any case, I hope this was as good for you as it was for me. </span><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/constructing-histograms-for-corpus.html" target="_blank"><span style="font-weight: normal;">Word Histograms for the Corpus </span></a>
</div>
<div style="background: transparent; margin-bottom: 0in;">
<br />
Next: Cosine Theta, Murdabad!</div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-55569515531910527292012-03-26T01:52:00.002-07:002012-03-29T11:28:10.371-07:00NTA3: StopAndRareWordsToBeOrNotToBeZipfExactly, that <i>is</i><span style="font-style: normal;">
the question!</span><br />
<br />
<span style="font-style: normal;">Numerical Text Analysis 3</span><br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank"><span style="font-style: normal;">Guide to All Text Analysis Posts</span></a><br />
<br />
<span style="font-style: normal;">Prev: <a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysisorhamletquestion_26.html" target="_blank">Text Analysis with Illustrative Example</a></span>
<br />
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<span style="font-style: normal;">Let's stop and think about stop
words. What is the reason to remove them before doing numerical text
analysis? Surely the reason is not that they account for 50% of all
the words in a text (they do) and that hence removing them will
result in data savings. These are words like 'the', 'and', 'to' and
'of' that occur with very high frequency in the corpus of all texts
and whose relative frequencies across texts are approximately
constant. As such </span><span style="font-style: normal;"><b>stop
words do not serve to distinguish one text from another.</b></span><span style="font-style: normal;">
(I was about to say that they do not have any semantic content, but
I'll carefully ... step ... around ... that ... mine.) So we can now
choose to drop some number of the most frequent words. Note that this
decision should absolutely be corpus-based! and not based on some
canonical list of stop words that some grad. student came up with.
For example, the most frequent words in the corpus(Wm. Shakespeare's
works), corpus(Charles A. Dodgson's works), corpus(LinkedIn profiles)
or corpus(Match.com profiles) will likely be different.</span><br />
<br />
<br />
<br />
So if there isn't a canonical set (nor even <i>number</i>) of stop
words, how should one choose the cut-off? For this let's consider
another, more numerical reason to drop stop words.
<br />
<div style="margin-bottom: 0in;">
<span style="font-style: normal;">In
Numerical Text Analysis, there is a law that captures the general
distribution of word frequencies in texts. For any text or corpus,
rank all the word tokens by decreasing order of frequency, so 'the'
is ranked 1 (relative frequency is 0.05) and something like
'zitterbewegung' –with relative frequency of 10</span><sup><span style="font-style: normal;">-9</span></sup><span style="font-style: normal;">
even within physics texts– is ranked 250,000. <a href="http://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank">Zipf's Law</a>
essentially states that the product of the rank and the frequency is
a constant function of rank. Mandelbrot –who got his start in
science doing numerical text analysis until he tried to analyse Jorge
Luis Borges' works and switched to chaos theory– made a
<a href="http://en.wikipedia.org/wiki/Zipf%E2%80%93Mandelbrot_law" target="_blank">modification to this law</a> that the exponent for frequency as a
function of rank is less than -1, and took into account deviations
near the ends. In any case, we then expect a plot of log</span><sub><span style="font-style: normal;">10</span></sub><span style="font-style: normal;">(frequency
* Rank) vs </span><span style="font-style: normal;">log</span><sub><span style="font-style: normal;">10</span></sub><span style="font-style: normal;">(Rank)</span><span style="font-style: normal;"> to approximate a straight line. So let's look at a
sample Zipf plot for Alice:</span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz9XghiNWEw8uz7L5ZeYx4JE4EgxIW-YNQAElmrE_CBV3nwdC4VffYsQqqo42zhQBe_YzytWbZu1CPgPUR1Vepa0fFAL7lReCRlQDr1_-cpcg5TcTcrheKoFXHtpBErT2V2ybYfiA2Imcb/s1600/_ZipfAliceOrig.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz9XghiNWEw8uz7L5ZeYx4JE4EgxIW-YNQAElmrE_CBV3nwdC4VffYsQqqo42zhQBe_YzytWbZu1CPgPUR1Vepa0fFAL7lReCRlQDr1_-cpcg5TcTcrheKoFXHtpBErT2V2ybYfiA2Imcb/s320/_ZipfAliceOrig.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>Alice in Wonderland</i> original text</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<div style="margin-bottom: 0in;">
Note that the way I am plotting the data is slightly different from Zipf's. Typically (meaning every instance I've seen so far) log(frequency) is plotted vs. log(rank), and one looks for a slope of -1. What that implies is that Frequency * Rank = constant, hence if I plot log(Frequency * Rank), data that satisfies Zipf's law should be flatlined, and deviations from Zipf's law will be zoomed up.<br />
<br />
"What's with the discontinuities?". The text has only 50K words and, after stemming, 2,000 tokens. There are a very large number of words (~700) that occur only once in the text. As we go up this (unsorted) subset, the frequency remains the same but the numerical rank keeps reducing, until we get to the top end and enter the subset of words which occur <i>twice</i>: at this point the rank only changes by '1', but the frequency doubles, adding the 0.3 on the log10 scale. Note that this discontinuity is almost half the <i>y</i>-axis range! For a large corpus the number of genuine words that occur only a few times will reduce and their rank will be pushed to larger numbers on a wider <i>y</i>-axis range, so even though the first discontinuity will still be log10(2), the plot will appear smoother.</div>
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<div style="font-style: normal; margin-bottom: 0in;">
We clearly see the
deviations from the expected power law behaviour at low ranks, i.e.
for frequently occouring words. Roughly speaking from the 100th token onwards we see that a fit is an
almost 0-slope line, whereas higher ranked tokens deviate from that
behaviour. If we wanted to be aggressive about
stemming, we would drop words ranked lower than 359. If we wanted to
be less aggressive, we would perhaps choose a cut off rank of 99. See the <a href="http://tatetech.blogspot.com/2012/03/constructing-histograms-for-corpus.html" target="_blank">application of this idea</a> to a corpus.</div>
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<div style="font-style: normal; margin-bottom: 0in;">
For the full corpus
of all texts, we should also expect deviations from linearity at the
low frequency end of the plot, and this should be used to find a cut
off for the rare words.</div>
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"><b>Conclusion
for stop and rare words</b></span><span style="font-style: normal;"> </span>
</div>
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<div style="font-style: normal; margin-bottom: 0in;">
For the corpus of
texts under consideration:
</div>
<ol>
<li><div style="font-style: normal; margin-bottom: 0in;">
Do a Cummings
and stem the words in the texts
</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
count the
frequencies of occurrence in the corpus of all the word tokens, and
rank the word tokens by order of frequency.</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
Plot
log(frequency * rank) vs. log(rank) of the word tokens – the Zipf Plot
for the corpus.
</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
Look for
deviations from the expected straight line behaviour and choose a
cut-off rank at both ends.</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
The set of
stop words are the word tokens with higher rank (more frequent) than
the corrsponding cut-off and the rare words are those with lower
rank (less frequent).</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
Remove these
sets of stop and rare words from all texts under consideration.</div>
</li>
<li><div style="font-style: normal; margin-bottom: 0in;">
The resulting
truncated histograms are the word-histograms for the texts that one
should then proceed to use for further numerical text analysis.<br />
<br />
Next: <a href="http://tatetech.blogspot.com/2012/03/wordhistogramsorhamletquestion1.html" target="_blank">Word Histograms </a></div>
</li>
</ol>
<div style="font-style: normal; margin-bottom: 0in;">
</div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-55209999556445959412012-03-26T01:49:00.001-07:002012-03-29T11:27:52.486-07:00NTA2: NumericalTextAnalysisOrHamletQuestionNumerical Text Analysis 2<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to all Text Analysis posts</a><br />
Prev: <a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysispreliminaries.html" target="_blank">Preliminaries </a><br />
How the first line of Hamlet's
soliloquy at the beginning of Act 3 Scene 1 of the eponymous play
becomes {question: 1}.
<br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Consider a text, either a web document,
an essay, a personal profile or a book, e.g. <a href="http://books.google.com/books/about/Alice_in_Wonderland.html?id=ID5P7xbmcO8C"><i>Alice
in Wonderland</i></a>. It is a sequence of words (woRds, wodrs, wd.s,
'words?' and 9) separated by whitespace. Before doing any numerical
work, we have to standardize the text using the text operations we
described in <a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysispreliminaries.html" target="_blank">NumericalTextAnalysisPreliminaries</a>.
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Here is what we know or what we should
follow:</div>
<div style="margin-bottom: 0in;">
Don't create more word tokens.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Lower</b><span style="font-weight: normal;">,
since it commutes with all other operators and can be applied to the
whole text, should be first.</span></div>
<div style="margin-bottom: 0in;">
<b>Stem</b><span style="font-weight: normal;">
needs to come after </span><b>DePunctuation</b><span style="font-weight: normal;">,
and since it operates on words, after </span><b>Parse</b><span style="font-weight: normal;">.
</span><b>Stop[rank]</b><span style="font-weight: normal;"> should be
used after all the rest of the cleaning-up, and since the rank is to
be determined based on word frequencies, should be after </span><b>Parse</b><span style="font-weight: normal;">
and </span><b>Histogram</b><span style="font-weight: normal;">.
</span><b>Histogram</b><span style="font-weight: normal;"> should come
after </span><b>Stem</b><span style="font-weight: normal;">. </span><b>Chop[n]</b><span style="font-weight: normal;">
is not needed. I don't know what to do about abbreviations and
mis-spellings, so I'll ignore those for now.</span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Standardization</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<ol>
<li><div style="margin-bottom: 0in;">
<b>DePunctuate</b><span style="font-weight: normal;">
(All punctuation but '-' deleted, '-' replaced by single space.)</span><b>
</b> These first two steps can act on either text or lists.
</div>
</li>
<li><div style="margin-bottom: 0in;">
<b>Lower</b>. The first two
together are <b>cummings</b><span style="font-weight: normal;">.</span></div>
</li>
<li><div style="margin-bottom: 0in;">
<b>Parse (</b><span style="font-weight: normal;">on
whitespace)</span></div>
</li>
<li><div style="margin-bottom: 0in;">
<b>Stem </b><span style="font-weight: normal;">(Porter2
from <a href="http://pypi.python.org/pypi/stemming/1.0">stemming
package</a> in Python by Matt Chaput)</span></div>
</li>
<li><div style="margin-bottom: 0in;">
<b>Histogram</b><span style="font-weight: normal;">
Can act on either text or word list. </span>
</div>
</li>
<li><div style="margin-bottom: 0in;">
<b>Stop[rank]</b><span style="font-weight: normal;">
Look at the Zipf plots of the histogram for the whole corpus to
determine the rank of the word token below which words are to be
dropped.</span></div>
</li>
<li><div style="margin-bottom: 0in;">
<b>DeRarify</b><span style="font-weight: normal;">
</span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in; text-align: left;">
First let me show you the Zipf plots
for the text as we sequentially process it. Recall from the post on <a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">To Zipf or not to Zipf</a> that it is a plot of the log of the product of the frequency and rank for any token vs log(rank). We expect it to be linear, indicating a negative power law dependence of the frequency on the rank.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
All processing is done using Python code, which also creates data files and calls R for plotting. When I get my hands on real data, the R program can also fit the data to Zipf's law or Mandelbrot's law, which will help with identifying the deviations from expected linear behaviour and the ranks at which we establish the 'stop' and 'rare' cut-offs. (To see more details of the figures and tables, click on them.)<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz9XghiNWEw8uz7L5ZeYx4JE4EgxIW-YNQAElmrE_CBV3nwdC4VffYsQqqo42zhQBe_YzytWbZu1CPgPUR1Vepa0fFAL7lReCRlQDr1_-cpcg5TcTcrheKoFXHtpBErT2V2ybYfiA2Imcb/s1600/_ZipfAliceOrig.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz9XghiNWEw8uz7L5ZeYx4JE4EgxIW-YNQAElmrE_CBV3nwdC4VffYsQqqo42zhQBe_YzytWbZu1CPgPUR1Vepa0fFAL7lReCRlQDr1_-cpcg5TcTcrheKoFXHtpBErT2V2ybYfiA2Imcb/s1600/_ZipfAliceOrig.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, original text</td></tr>
</tbody></table>
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
From right to left, the different segments correspond to tokens which appear only once, then those that appear twice, thrice etc. The most frequent word 'the' sits at the bottom left corner. With a large corpus, we expect the curve to be smoothed out, especially in the middle. <br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMWQwsaqosj2GKsRppsZXd3owJ8fPiFUhHQG81AoFGZx-byXPFyOYlp_c3uN6puomDz5XYOppSx2K_-nOe6-V5qaBrssUpJprjIY9de2EE-2LGRTNF3PE1SQ3Z4ekSeTkEsQBo1zkhk1KV/s1600/_ZipfAliceDePunc.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMWQwsaqosj2GKsRppsZXd3owJ8fPiFUhHQG81AoFGZx-byXPFyOYlp_c3uN6puomDz5XYOppSx2K_-nOe6-V5qaBrssUpJprjIY9de2EE-2LGRTNF3PE1SQ3Z4ekSeTkEsQBo1zkhk1KV/s1600/_ZipfAliceDePunc.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, text with punctuation removed</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhY5TL3tEcBt6fHy4777wCdpQO-_iU52U5U30PGoslIAPH-zRUCLYBtfeCQSItP2F-fIk8gFOII0ae844YxECr8bNNLIDKIRd1DbNA3AD6Eal2zsdUmyYITPrHkFTteYwyX0RWGH5zYrMGI/s1600/_ZipfAliceLow.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhY5TL3tEcBt6fHy4777wCdpQO-_iU52U5U30PGoslIAPH-zRUCLYBtfeCQSItP2F-fIk8gFOII0ae844YxECr8bNNLIDKIRd1DbNA3AD6Eal2zsdUmyYITPrHkFTteYwyX0RWGH5zYrMGI/s1600/_ZipfAliceLow.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, lowercased text</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivjFHUf1dE2cFF6NP7Bh_Wq0JwGvSWjHOAj2FCqL_YLEEzAB-bThygzXRWVMGX01JWN7GKnuQaS6GNARZcHIfK31l9j5Lc3fNRLyBsl-dViaJ_yg0F8t57yRgKW4LSj3_n-wO0wiM1wD4F/s1600/_ZipfAliceStem.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivjFHUf1dE2cFF6NP7Bh_Wq0JwGvSWjHOAj2FCqL_YLEEzAB-bThygzXRWVMGX01JWN7GKnuQaS6GNARZcHIfK31l9j5Lc3fNRLyBsl-dViaJ_yg0F8t57yRgKW4LSj3_n-wO0wiM1wD4F/s1600/_ZipfAliceStem.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, histogram of stemmed tokens</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUGnK6KmuCyhkPtb3Yhx4ANY19tIs19sbtJGMwZRlgVOaOTacxlDQ16LQwHu3IYOvTuJAtcGbAK9_t2cUku1QFNMc8T8PVRRuLGCdIrGGo_VDCU9aTZZwiGi1tzgodWXAjFfvzUUwuDzTk/s1600/_ZipfAliceStpd.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUGnK6KmuCyhkPtb3Yhx4ANY19tIs19sbtJGMwZRlgVOaOTacxlDQ16LQwHu3IYOvTuJAtcGbAK9_t2cUku1QFNMc8T8PVRRuLGCdIrGGo_VDCU9aTZZwiGi1tzgodWXAjFfvzUUwuDzTk/s1600/_ZipfAliceStpd.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, histogram with 100 stop words removed </td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
</li>
<li>These 100 'stop' words are just the 100 most frequent words in the text itself. As I have explained in the <a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">Zipf post</a>, the cut off should be based on the histogram of the entire corpus.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="margin-bottom: 0in;">
Again, this is for illustration purposes only.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNlgi6xwZvGvs7xeLZWag0W7uufNypyyJQ7_csdUWhi73sBDPE6OAVCO33i9GcNE8cjYu4Bv8hDJjKAhMNCSJP1DvgkT7JplKkGx8W_tfK1HwztcK7HZdIUzXkfnmqOKSzoH2I_XM27MUI/s1600/_ZipfAliceDeRr.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNlgi6xwZvGvs7xeLZWag0W7uufNypyyJQ7_csdUWhi73sBDPE6OAVCO33i9GcNE8cjYu4Bv8hDJjKAhMNCSJP1DvgkT7JplKkGx8W_tfK1HwztcK7HZdIUzXkfnmqOKSzoH2I_XM27MUI/s1600/_ZipfAliceDeRr.jpeg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Zipf Plot of <i>Alice</i>, histogram with 20 rare words removed </td></tr>
</tbody></table>
</div>
</li>
</ol>
<div style="margin-bottom: 0in; text-indent: -0.25in;">
</div>
<ol start="7"><div style="margin-bottom: 0in;">
Now, let's take a look at the 20 top
and 20 bottom ranked tokens in Alice, and see how this list changes
with processing.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6qQ5bw0VLjyEI4KgdkE8jVUU4g-uk-N1t0KJVP39b5kUhgdx8Rdqnld9JaXZ7GL9imG7oryXKACEdhA6AU2UAVMrWQNt6-6F20o0Hivgb4fVWBNYk_83ieY5PeXhiSzoxPWtt71QMw3Uw/s1600/AliceTop20.tiff" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="387" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6qQ5bw0VLjyEI4KgdkE8jVUU4g-uk-N1t0KJVP39b5kUhgdx8Rdqnld9JaXZ7GL9imG7oryXKACEdhA6AU2UAVMrWQNt6-6F20o0Hivgb4fVWBNYk_83ieY5PeXhiSzoxPWtt71QMw3Uw/s640/AliceTop20.tiff" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Top 20 Tokens</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgizBihxwEMN1XcKrG-q0KsFaGV4TnWr5MqydiwD_chYom4i6Bxv7KvIX8B0HvikQMvNZS-KFw0OqyJaObPhp-dfRYiEkFJYm1O9f4ChjhTiEVj2DygTX-5tlELM6SwfK80JccAP2k-6RdV/s1600/AliceBot20.tiff" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="267" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgizBihxwEMN1XcKrG-q0KsFaGV4TnWr5MqydiwD_chYom4i6Bxv7KvIX8B0HvikQMvNZS-KFw0OqyJaObPhp-dfRYiEkFJYm1O9f4ChjhTiEVj2DygTX-5tlELM6SwfK80JccAP2k-6RdV/s640/AliceBot20.tiff" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Bottom 20 Tokens</td></tr>
</tbody></table>
<div style="margin-bottom: 0in;">
Summarizing,</div>
</ol>
<ol start="7"><div style="margin-bottom: 0in;">
The above steps suffice for
spell-checked and edited works, the following steps need to be
implemented for user provided text.</div>
<li><div style="margin-bottom: 0in;">
Expand any recognizable abbr to
its most likely form, so there is no remaining ambiguity. (I am not
sure about this step, but presumably it involves looking up some
standard dictionary.) From the remaining sequences of letters,
identify the non-words and check if there are any likely words in
the dictionary that could have been misspelled as the non-word. For
example, consider the non-word 'mnager'. Let's say that 'manager' is
likely to be misspelled as 'mnager' (dropped letter 'a') 5% of the
time (where the probabilities over all result words – spellings
and misspellings – of the source word 'manager' sum to 1), whereas
'manger' is likely to be misspelled as 'mnager' ('a' and 'n'
transposed dyslexically) 10% of the time. These are <i>not</i><span style="font-style: normal;">
the probabilities to consider and thence assign 'mnager' to
'manger'. Since 'manager' is a much more frequently used word than
is 'manger', 'mnager' is much more likely to be a misspelling of
'manager'. So we want to look at the probabilities that 'mnager'
results as a misspelling of a source word ('manager' or 'manger'),
where the sum over all source words equals 1, and then select the
most likely source. Hence 'mnager' would be assigned to 'manager',
unless one were to have additional information, e.g. that the text
is biblical in nature.</span></div>
</li>
<li><div style="margin-bottom: 0in;">
Toss out 'fzdurq' and other
non-words., so that all remaining sequences of letters are words.</div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;">These
last two complicated steps can be substituted by the step of simply
removing the letter sequences which are least frequent in the
corpus, which are either terms from highly specialized jargon,
mis-spellings or abreviations. </span>
</div>
</li>
</ol>
<div align="LEFT" style="margin-bottom: 0in;">
So you see how “To be, or
not to be, that is the question:” is rendered “question”. The
corpus of the text remains; ready for some numerical, but not
literary, analysis.
</div>
<div style="margin-bottom: 0in;">
<br />
Next: <a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">Selecting Stop Tokens</a></div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-19012011322535922662012-03-26T01:26:00.002-07:002012-03-29T11:27:31.057-07:00NTA1: NumericalTextAnalysisPreliminariesNumerical Text Analysis 1<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to all Text Analysis Posts</a><br />
<br />
Prev: <a href="http://tatetech.blogspot.com/2012/03/sampletextforanalysis.html" target="_blank">Sample Text </a><br />
<br />
Definitions and Text Operations:
<br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Text atoms: Uppercase letters (<i>L</i>),
lowercase letters (<i>u</i>), punctuation marks (<i>?</i>), numerals
(<i>4</i><span style="font-style: normal;">).</span>
</div>
<div style="margin-bottom: 0in;">
Accent Marks: None in English, just
consider accented letters as independent letters?</div>
<div style="margin-bottom: 0in;">
Whitespace: space, tab, newline,
end-of-line, end-of-file.</div>
<div style="margin-bottom: 0in;">
Word: A sequence of text atoms; in
text, preceded and succeeded by whitespace.</div>
<div style="margin-bottom: 0in;">
Text: A sequence of text atoms and
whitespace. Equivalently, a sequence of words separated by
whitespace. Text space is a linear space.</div>
<div style="margin-bottom: 0in;">
Corpus: A collection of texts.</div>
<div style="margin-bottom: 0in;">
Word list: a list of words, order
usually does not matter. It is generated either fro a text or from
another list.</div>
<div style="margin-bottom: 0in;">
Dictionary word: A word found in a
dictionary or in one of a set of dictionaries.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Parse</b><span style="font-weight: normal;">:
(By default on whitespace, but a parse map can be defined for any
sequence of words.) This is an operation that acts on </span><i><span style="font-weight: normal;">text</span></i><span style="font-weight: normal;">
and generates a </span><i><span style="font-weight: normal;">list</span></i><span style="font-weight: normal;">
of words, so </span><b><span style="font-style: normal;">parse</span></b><span style="font-weight: normal;">
is not in fact a text operation. It maps a point from one space into
another.</span></div>
<div style="font-weight: normal; margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Various operations can be carried out
on text (or on lists, since each element of a list is text), the ones
relevant for our purposes are:</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Lower</b>: Acts on any piece of
text, converts all uppercase letters to the corresponding lowercase
ones.</div>
<div style="margin-bottom: 0in;">
<b>lower</b><span style="font-weight: normal;">(</span><i><span style="font-weight: normal;">upper's
and(?) LOWER</span></i><span style="font-weight: normal;">) = </span><i><span style="font-weight: normal;">upper's
and(?) lower</span></i></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>DePunctuate</b>: Almost all
punctuation in normal English text is juxtaposed with whitespace
(WJPMs). Punctuation marks which are normally juxtaposed with
whitespace can either be deleted or replaced with whitespace, in
either case one will be left with just whitespace. The two exceptions
are the hyphen '-' and the apostrophe “'”. '-' occours in
hyphenated words like 'super-talented', which are composed of
independent words. Hence the hyphen should be replaced by a
whitespace. The apostrophe occours in possessives and contractions,
e.g. “dog's”, “she'd” and “isn't”. In order to not create
single letter words and peculiarities like “isn”, the apostrophe
should be deleted. Note that stemming leaves unchanged all three of
'isn', 'isnt' and 't', and hence can't be used to make a choice. With
either choice for eliminating the WJPMs, we will have one exception,
either the hyphen or the apostrophe. For simplicity, let us delete
the WJPMs without replacement, delete the apostrophe without
replacement as well, and replace the hyphen with whitespace.
</div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"><b>DePunctuate</b></span><span style="font-style: normal;"><span style="font-weight: normal;">(</span></span><i><span style="font-weight: normal;">“The
dog's bone was super-tasty! Wasn't it?” she asked.</span></i><span style="font-style: normal;"><span style="font-weight: normal;">)
= </span></span><i><span style="font-weight: normal;">The dogs bone
was super tasty Wasnt it she asked</span></i></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Chop[n]</b>: Eliminates words of
length <i>n</i><span style="font-style: normal;"> or less, with
possible exceptions like 'i' and 'a' for </span><i>n</i><span style="font-style: normal;">
= 1.</span></div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"> </span><span style="font-style: normal;"><b>Chop[2]</b></span><span style="font-style: normal;"><span style="font-weight: normal;">(</span></span><i><span style="font-weight: normal;">To
be or not to be, that is the question.</span></i><span style="font-style: normal;"><span style="font-weight: normal;">)
= </span></span><i><span style="font-weight: normal;">not be, that the
question.</span></i><span style="font-style: normal;"> </span>
</div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;">I
thought of this as a short cut to </span><span style="font-style: normal;"><b>stop</b></span><span style="font-style: normal;"><span style="font-weight: normal;">.
With the choice of </span></span><span style="font-style: normal;"><b>depunctuate</b></span><span style="font-style: normal;"><span style="font-weight: normal;">
above that punctuation marks (except for '-', which is replaced by
whitespace) are simply deleted, there shouldn't be single letter
words other than 'i' and 'a'. Having rethought </span></span><span style="font-style: normal;"><b>stop</b></span><span style="font-style: normal;"><span style="font-weight: normal;">,
</span></span><span style="font-style: normal;"><b>chop</b></span><span style="font-style: normal;"><span style="font-weight: normal;">
became superfluous.</span></span></div>
<div style="font-style: normal; margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<span style="font-style: normal;"><b>Stem</b></span><span style="font-style: normal;"><span style="font-weight: normal;">:
Acts on a word and returns its stem. Conjugations of verbs, plurals,
adjectives, adverbs and other modifiers often have the same stem.
Note that the result of </span></span><span style="font-style: normal;"><b>stem</b></span><span style="font-style: normal;"><span style="font-weight: normal;">
on a regular English word is not necessarily a dictionary word – it
is only the sequence of letters identified as the stem. There are
many stemming algorithms. I've used Porter2 from the <a href="http://pypi.python.org/pypi/stemming/1.0">Python
implementation</a> by Matt Chaput.</span></span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzeBGAzxbGlXxCx7Ew2aifWSddfIl1aJqaJJmqinLjhvomURwY_Iz5TuDg7gtASZfZ4mihUgEnuCAtcZ0FPkvbvpILLgHJ9W6T1ZbqJ5Zj1x0tFdnxzTLOmmaqxgAgekN7ILYg8KdUj4sR/s1600/stemtest.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzeBGAzxbGlXxCx7Ew2aifWSddfIl1aJqaJJmqinLjhvomURwY_Iz5TuDg7gtASZfZ4mihUgEnuCAtcZ0FPkvbvpILLgHJ9W6T1ZbqJ5Zj1x0tFdnxzTLOmmaqxgAgekN7ILYg8KdUj4sR/s320/stemtest.tiff" width="159" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
I had misunderstood the purpose of
<b>stem</b><span style="font-weight: normal;">, and when I saw words
like </span><i><span style="font-weight: normal;">veget</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
and </span></span><i><span style="font-weight: normal;">readabl</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
I wrote </span></span><span style="font-style: normal;"><b>ifstem</b></span><span style="font-style: normal;"><span style="font-weight: normal;">
so it would only replace a word by its stem if the stem was a
dictionary word. </span></span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Stop[rank]</b><span style="font-weight: normal;">:
Eliminates words whose tokens rank higher (are more frequent) than
</span><i><span style="font-weight: normal;">rank</span></i><span style="font-style: normal;"><span style="font-weight: normal;">.
See </span></span>
</div>
<div style="margin-bottom: 0in;">
StopWordsToBeOrNotToBeZipf Post to appear in the next day or two.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Derarify[rank from bottom]</b>:
Eliminates words whose tokens are very low-ranking by frequency. <span style="font-style: normal;"><span style="font-weight: normal;">See
</span></span>
</div>
<div style="margin-bottom: 0in;">
StopWordsToBeOrNotToBeZipf </div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>SpellCheck</b><span style="font-weight: normal;">:
Not thought about this yet. Eliminating the least frequent tokens
–DeRarifying– should take care of some mis-spellings.</span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>ImProper</b><span style="font-weight: normal;">:
Identify and eliminate proper nouns. While any one given text may
have a few names that occur with high frequency (for example in </span><i><span style="font-weight: normal;">Alice
in Wonderland</span></i><span style="font-weight: normal;"> the name
Alice is the 10</span><sup><span style="font-weight: normal;">th</span></sup><span style="font-weight: normal;">
ranked word), over the whole corpus most proper nouns should be
relative rarities and can be eliminated by DeRarifying the text. The
most common proper nouns can perhaps be listed and eliminated
separately. </span>
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
<b>Histogram</b><span style="font-weight: normal;">:
Again, this is not a text operation, in fact it acts on a wordlist
and returns a list of tuples or a dictionary (list of key:value
pairs) whose first elements or keys are the members of the set
constructed from the elements of the list and the second elements or
values are the number of times they occur in the list. The space of
word histograms is linear, has a natural origin and is non-negative,
it is a whole-number valued “vector space”. So lets call these
things “word vectors”. For any given text, then, one can
construct its word vector.</span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
The above operations don't commute with
each other, hence the order in which they are carried out will affect
the result.
</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Note for further thought: the above
operations are linear operators on a linear space, and they do have
eigenvalues and eigenvectors. But the operators are not all
symmetric, and some of their eigenvectors lie outside text space.
E.g. pure lower case text are eigenvectors of <b>lower</b><span style="font-weight: normal;">
with eigenvalue 1. A word vector consisting of lowercase text </span><i><span style="font-weight: normal;">low</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
and</span></span><span style="font-weight: normal;"> equal and
opposite frequencies of occurrence of text whose lowercase is </span><i><span style="font-weight: normal;">low</span></i><span style="font-style: normal;"><span style="font-weight: normal;">
(e.g. </span></span><i><span style="font-weight: normal;">lOw</span></i><span style="font-style: normal;"><span style="font-weight: normal;">,
</span></span><i><span style="font-weight: normal;">LOW</span></i><span style="font-style: normal;"><span style="font-weight: normal;">)
is an eigenvector with eigenvalue 0. (See my future post on how to
extend text space to make some of these operators symmetric.)</span></span></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Next: <a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysisorhamletquestion_26.html" target="_blank">Text Analysis with Example</a></div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-37009526422053840362012-03-26T00:25:00.001-07:002012-03-29T11:27:13.037-07:00NTA0: SampleTextForAnalysisNumerical Text Analysis 0<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to all Text Analysis posts </a><br />
<br />
Try and guess the text I am going to
use as an illustrative example. At a certain stage of
text-processing, it has somewhat fewer than 2,000 word-tokens and a
total length of a bit less than 27,000 words. I rank these 2,000 word
tokens by order of frequency of appearance in the text. Here are some
of the salient tokens and their ranks in increasing order: scan the
tokens one at a time, and stop when you guess the text they are from.
Give your points corresponding to the rank of the word you stopped at
and report back to me via a comment.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAPUv4yFEi_zccnhemiyBDSlZLzeNABZOxoNM364x1zs-R3Ib33EU319X6EyMX826FK69gw9OzNEqOqcD50HfKn5Jxgc3Q7cyr_E5h6r-2X2ujKKD60ZNQosr_vl0IEHieFFP2BYQk9PDj/s1600/Ecila.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAPUv4yFEi_zccnhemiyBDSlZLzeNABZOxoNM364x1zs-R3Ib33EU319X6EyMX826FK69gw9OzNEqOqcD50HfKn5Jxgc3Q7cyr_E5h6r-2X2ujKKD60ZNQosr_vl0IEHieFFP2BYQk9PDj/s320/Ecila.tiff" width="180" /></a></div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Here is more information about the
words:</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp9ythSDEpPuyNq7Mfl0T5qAZpzkCEY75IY4pnx4XxRYmU7YkIHD1MnZvpvKYAX0K78ErTXTxXIJtmfedbSWHIkCEXgsvV2xxzlBN2eNAMxaaHPVv6YYhjnNjZyPESV4SnoR-GX1GgWd9s/s1600/AliceWds.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp9ythSDEpPuyNq7Mfl0T5qAZpzkCEY75IY4pnx4XxRYmU7YkIHD1MnZvpvKYAX0K78ErTXTxXIJtmfedbSWHIkCEXgsvV2xxzlBN2eNAMxaaHPVv6YYhjnNjZyPESV4SnoR-GX1GgWd9s/s320/AliceWds.tiff" width="319" /></a></div>
<br />
The word 'head' appears only 4/5 of the
times that the word 'queen' does. Assuming that 'head' appears half
the time in neutral circumstances, the Queen is only yelling “Off
with his head!” 2/5 of the times she makes an appearance.<br />
<br />
The full text can be found<a href="http://books.google.com/books/about/Alice_in_Wonderland.html?id=ID5P7xbmcO8C" target="_blank"> here</a>. <br />
<div style="margin-bottom: 0in;">
<br />
Next: <a href="http://tatetech.blogspot.com/2012/03/numericaltextanalysispreliminaries.html" target="_blank">Preliminaries</a></div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com1tag:blogger.com,1999:blog-6519535535434149951.post-23832538609869221232012-03-22T02:24:00.000-07:002012-03-29T11:28:25.022-07:00NTA4: WordHistogramsOrHamlet{question:1}Numerical Text Analysis 4<br />
<br />
<a href="http://tatetech.blogspot.com/2012/03/guide-to-posts-on-text-analysis.html" target="_blank">Guide to All text Analysis posts</a><br />
<br />
Prev: <a href="http://tatetech.blogspot.com/2012/03/stopandrarewordstobeornottobezipf.html" target="_blank">Selecting Stop Tokens</a><br />
<br />
WordHistogramsOrHamlet{question:1}<br />
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Consider the space of word-vectors:
</div>
<ol start="0">
<li><div style="margin-bottom: 0in;">
If there are D different word
tokens in the entire collection of texts under consideration, then
word-vector space is D-dimensional.</div>
</li>
</ol>
<ol>
<li><div style="margin-bottom: 0in;">
It is a linear space, but it is
not exactly a vector space: Though there is an origin or 0-vector
(null-texts), word-vectors do not have additive inverses that I can
interpret. Hence there is no subtraction defined on the space of
word-vectors. Hence word-vector space is the D-dimensional positive
1/2<sup>D</sup>-ant.</div>
</li>
</ol>
<div style="margin-bottom: 0in;">
2. Two texts are equivalent if they
have the same word histogram, i.e. the same distribution of the
frequencies of occurrence of the stemmed, non-stop word-tokens.
Word-vectors are then equivalence classes of texts. </div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
For example, of the following 3 texts,
Text A and B are equivalent to each other, but not to Text C.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Text A: This “teXt” is equivalent
to the following text, Not the preceding one?</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Text B: The preceding text IS
equivalent to “this one”, but not to the following texts.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
Text C: “This text” is not the same
as the preceding one, and has no following text.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
After removing punctuation,
lowercasing, stemming and removing stop words, their histograms are
A: {text: 2, preced: 1, follow: 1, equival: 1}, </div>
<div style="margin-bottom: 0in;">
B: {text: 2, preced:
1, follow: 1, equival: 1} and </div>
<div style="margin-bottom: 0in;">
C: {text: 2, preced: 1, follow: 1,
equival: 0}.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<ol start="3">
<li><div style="margin-bottom: 0in;">
Two word-vectors can be added to
each other, and they can be scaled by positive integers. If we
define the '+' of two texts to be the text formed by writing one
text followed by the other, then</div>
<div style="margin-bottom: 0in;">
vec(Text D) + vec(Text E) = vec(Text D
'+' Text E)</div>
</li>
<li><div style="margin-bottom: 0in;">
You can effectively <i>divide</i>
a word-vector V by a positive integer <i>q</i><span style="font-style: normal;">
(and hence scale by positive rationals) by leaving V alone and
scaling all the rest by </span><i>q</i><span style="font-style: normal;">.
</span>
</div>
</li>
</ol>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
If we normalize all word-vectors to
unit magnitude (i.e. as vectors, so the sum of the squares of the
frequencies is 1), then the resulting space is the positive
1/2<sup>D-1</sup>-sphere and texts are now points on this hemisphere.</div>
<div style="margin-bottom: 0in;">
<br /></div>
However, we could also normalize the word-histograms, so the sum of
the frequencies themselves – <i>not</i><span style="font-style: normal;">
the sum of their squares – </span>is 1. In this case the resulting
space is the D-1 simplex, which is tangent to the sphere at (1, 1, 1,
…). Note that the space of probabilities for the occurrence of D
mutually exclusive events is a D-simplex, and that word-histograms
have a natural probabilistic interpretation but not an obvious
interpretation as vectors. So in most cases we will prefer to live on
the simplex and not the sphere.<br />
<br />
What does text-space look like? I
want you to imagine yourself at the origin, at the point
corresponding to the null-text. Small texts with few words are close
by and large texts are further away. All 'normal' texts are in the
positive 2<sup>-D</sup>-ant (<span style="font-family: Times New Roman,serif;">π</span><span style="font-family: Times New Roman,serif;">/2
radians in 2D and </span><span style="font-family: Times New Roman,serif;">π</span><span style="font-family: Times New Roman,serif;">
steradians in 3D etc.). Similar texts are along the same line of
sight and very dissimilar texts will be at large angles to each
other, up to a maximum of 90 degrees. Whether we've normalized the
texts as vectors (on the unit sphere) or as histograms (on the unit
simplex), the angles between them </span><span style="font-family: Times New Roman,serif;"><i>as
viewed from the origin</i></span><span style="font-family: Times New Roman,serif;"><span style="font-style: normal;">
are the same. </span></span><br />
<span style="font-family: Times New Roman,serif;"><span style="font-style: normal;"> </span></span>
<br />
<span style="font-family: Times New Roman,serif;"><span style="font-style: normal;">The word histogram for the first line of Hamlet's
soliloquy at the beginning of Act 3 Scene 1 of the eponymous play is
{question: 1}.</span></span>
<br />
<div style="margin-bottom: 0in;">
<br />
Still fairly simple undergrad or even high school CS skills (based on the really smart interns at LinkedIn during the summer of 2011) and perhaps a graduate numerical text analysis course.<br />
<br />
Next: An aside <a href="http://tatetech.blogspot.com/2012/03/wordvectorspaceplay.html" target="_blank">Word Play</a><br />
or see the <a href="http://tatetech.blogspot.com/2012/03/constructing-histograms-for-corpus.html" target="_blank">word histogram for the corpus</a><br />
<br /></div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-85594314359636538212012-01-23T16:14:00.001-08:002015-04-27T17:03:38.804-07:00Factorial calculator<title>Factorial Test - Javascript</title>
<script type="text/javascript">
function factorial(f) { return ((f <= 1) ? 1 : f * factorial(f - 1)); }
function fact() {
var mana = document.calc_form.mana.value;
var answer = '';
if (mana !== '') {
answer = factorial(mana/1);
}
document.calc_form.answer.value = answer;
return false;
}
</script>
<br />
<h1>
Factorial - JavaScript </h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">Little test bits as I build up to implementing the computational and analytical algorithms in JavaScript. This is the first dynamic webpage I've ever written!</span></span></h1>
<h1>
<span style="font-size: large;"><span style="font-weight: normal;">Please enter a natural number less than 171. But feel free to try a larger number, and don't trust answers for non-natural numbers, it is NOT the Gamma function extension to reals or complexes. </span></span></h1>
<form action="" name="calc_form" onsubmit="return fact();">
<label>Natural <input name="mana" /></label><br />
<input type="submit" value="Calculate" /><br />
<label>Factorial <input name="answer" value="" /></label><br />
<br />
Thank you Chris!
</form>
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-83971008438342846812012-01-20T10:04:00.000-08:002012-01-23T17:19:45.068-08:00test<title>Mana Vita Will - Javascript</title>
<script type="text/javascript">
function calculate() {
var mana = document.calc_form.mana.value;
var vita = document.calc_form.vita.value;
var will = document.calc_form.will.value;
var answer = '';
if (mana !== '' && vita !== '' && will !== '') {
answer = (30 + (mana * 1.25) / 10 + (vita * 0.75) / 10) * (will / 10);
}
document.calc_form.answer.value = answer;
return false;
}
</script>
<br />
<h1>
Mana Vita Will - Javascript</h1>
<form action="" name="calc_form" onsubmit="return calculate();">
<label>Mana <input name="mana" /></label><br />
<label>Vita <input name="vita" /></label><br />
<label>Will <input name="will" /></label><br />
<input type="submit" value="Calculate" /><br />
<label>Answer <input name="answer" value="" /></label>
</form>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-1670311755007512792011-09-09T22:21:00.000-07:002011-09-11T14:11:26.928-07:00Rate of FemINism amongst LinkedIn members?Have you recently joined LinkedIn or edited your profile lately? Below the lines for your first and last names, there is a line for "Former/Maiden Name:".<br />
<span style="font-size: large;"><u>The evidence</u></span><br />
<u> </u> <br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMLgZP5G6ffTMf_fl7IMrLNULQosS3q5pLEWJlKqeu_BXOBZegaVgEp4uKT6Wyi89qEwmwLqg-zS1vq2tG-7KNjzGsFC58u432RTzGmdlKKakWDf8-XSu_rp6DcXuCgwOtysbzyzLhyphenhyphen2kq/s1600/femINist.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMLgZP5G6ffTMf_fl7IMrLNULQosS3q5pLEWJlKqeu_BXOBZegaVgEp4uKT6Wyi89qEwmwLqg-zS1vq2tG-7KNjzGsFC58u432RTzGmdlKKakWDf8-XSu_rp6DcXuCgwOtysbzyzLhyphenhyphen2kq/s320/femINist.jpg" width="283" /></a></div><br />
The obvious question is, "How and when did it come to be there?".<br />
<br />
<span style="font-size: large;"><u>The question</u></span> The interesting meta-question, Ash, ish "Why has this so very 19th century concept persisted well into the 21st century, and what does it tell us about the LinkedIn membership?"<br />
<br />
<span style="font-size: large;"><u>The answer</u></span> Here's (Thank you Megrah!) my answer: It is because the rate of fem<span style="color: magenta;">IN</span>ism amongst LinkedIN members is very low, in fact about two per million or <b>one in 500,000</b> in the best case scenario.<br />
<br />
<span style="font-size: large;"><u>My assumptions</u></span>:<br />
1) Had they seen "Former/Maiden Name" <i>any</i> feminist worth their (Sorry, Megrah!) salt would object to that language ...<br />
2) ... and have taken action --the difference between feminism and namby-pamby "humanism" is <b>action</b>-- for example by bringing it to the attention of LinkedIn...<br />
3) ...and LinkedIn would have removed the sexist language.<br />
(As of 9 September 2011, that line was still there on the "edit profile" page.)<br />
4) Every former or current LinkedIn member has seen that line at least once, during registration.<br />
<br />
<span style="font-size: large;"><u>Worst case analysis:</u></span> There are currently upwards of 120 million LinkedIn members. Only one "took action" - walking a 100' and chatting. So in the worst case analysis, the rate of femINism is 1 in 120 million.<br />
<br />
But wait, LinkedIn hasn't always had a 120 million members! And, how long did it take that one person (myself) to act?<br />
<br />
That calls for a more sophisticated analysis, taking into account the duration of exposure of the sexist line to members and the time it took me from when I could have first noticed it till I acted.<br />
<br />
<span style="font-size: large;"><u>Let's do the math, Barbie!</u></span> Take LinkedIn's membership numbers: 4.5K in June 2003 increasing to 120M in August 2011. Assume, 5) exponential growth and calculate the time-constant (in base-10 it is about 1/2 per year). Then simply integrate the membership over time and you get about 1million man-years! Ohh ... OK ... people-years.<br />
<br />
I've been a member of LinkedIn for about 2 years, and it took me that long to notice and take (minor) action - so that is 2 people-years in the numerator. And there you have it:<br />
<br />
<span style="color: magenta;"><b><span style="font-size: large;">Rate of FemINism amongst LinkedIn membership is at most 1 in 500,000!</span></b><span style="color: black;"><b><span style="font-size: large;"> </span></b> </span></span><br />
<br />
<span style="color: magenta;"><span style="color: black;"><span style="font-size: large;"><u>One possible quibble</u></span> and why it beautifully doesn't matter because of exponential growth: What if that line was only introduced later, not at the very beginning but say when membership was already 10 times as large as at the beginning? Wouldn't that mean that the rate of feminism is actually much better, 2 per 100,000 (one 10th of million) , which is really 1 out 50,000? That doesn't look too bad! Is the analysis so sensitive to assumptions about initial conditions?</span></span><br />
<span style="color: magenta;"><span style="color: black;"> </span></span><br />
<span style="color: magenta;"><span style="color: black;">Well no. It takes LinkedIn only 2 years to increase its membership by a <i>factor of 10</i>. Over those early 2 years with exponentially less members, the loss in member exposure turns out to be only 135,000 people-years. So even with that ameliorative assumption, we would get the rate of femINism to be 2 per (1million - 135,000) , or about 1 per 430,000 as opposed to 1 per 500,000. <b>Does that really make you feel better?</b></span></span><br />
<br />
<span style="font-size: large;"><u>Full Disclosure</u></span>: I am an intern at LinkedIn (and proud of it) and a femINist (and proud of it).ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com1tag:blogger.com,1999:blog-6519535535434149951.post-8853183396840702752011-09-01T00:42:00.000-07:002011-09-01T00:42:02.730-07:00Cosine Similarity Murdabad! <link href="file://localhost/Users/rtate/Library/Caches/TemporaryItems/msoclip/0clip_filelist.xml" rel="File-List"></link> <!--[if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--> <link href="file://localhost/Users/rtate/Library/Caches/TemporaryItems/msoclip/0clip_themedata.xml" rel="themeData"></link> <!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>JA</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> <w:UseFELayout/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="--"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276"> <w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/> <w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/> <w:LsdException Locked="false" Priority="39" Name="toc 1"/> <w:LsdException Locked="false" Priority="39" Name="toc 2"/> <w:LsdException Locked="false" Priority="39" Name="toc 3"/> <w:LsdException Locked="false" Priority="39" Name="toc 4"/> <w:LsdException Locked="false" Priority="39" Name="toc 5"/> <w:LsdException Locked="false" Priority="39" Name="toc 6"/> <w:LsdException Locked="false" Priority="39" Name="toc 7"/> <w:LsdException Locked="false" Priority="39" Name="toc 8"/> <w:LsdException Locked="false" Priority="39" Name="toc 9"/> <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/> <w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/> <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/> <w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/> <w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/> <w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/> <w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/> <w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/> <w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/> <w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/> <w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/> <w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/> <w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/> <w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/> <w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/> <w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/> <w:LsdException Locked="false" Priority="37" Name="Bibliography"/> <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/> </w:LatentStyles> </xml><![endif]--> <style>
<!--
/* Font Definitions */
@font-face
{font-family:"MS 明朝";
mso-font-charset:78;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:"MS 明朝";
mso-font-charset:78;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:Cambria;
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-536870145 1073743103 0 0 415 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
p.MsoFootnoteText, li.MsoFootnoteText, div.MsoFootnoteText
{mso-style-priority:99;
mso-style-link:"Footnote Text Char";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
span.MsoFootnoteReference
{mso-style-priority:99;
vertical-align:super;}
span.FootnoteTextChar
{mso-style-name:"Footnote Text Char";
mso-style-priority:99;
mso-style-unhide:no;
mso-style-locked:yes;
mso-style-link:"Footnote Text";}
.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
/* Page Definitions */
@page
{mso-footnote-separator:url("Macintosh HD:Users:rtate:Library:Caches:TemporaryItems:msoclip:0clip_header.htm") fs;
mso-footnote-continuation-separator:url("Macintosh HD:Users:rtate:Library:Caches:TemporaryItems:msoclip:0clip_header.htm") fcs;
mso-endnote-separator:url("Macintosh HD:Users:rtate:Library:Caches:TemporaryItems:msoclip:0clip_header.htm") es;
mso-endnote-continuation-separator:url("Macintosh HD:Users:rtate:Library:Caches:TemporaryItems:msoclip:0clip_header.htm") ecs;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.WordSection1
{page:WordSection1;}
-->
</style> <!--[if gte mso 10]> <style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style> <![endif]--> <!--StartFragment--> <br />
<div class="MsoNormal">COSINE SIMILARITY MURDABAD!<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">On the use of the “Cosine Similarity” as a quantifier<span> </span>of the relatedness<span> </span>of elements of<span> </span>a linear space. <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">The Cosine Similarity is one of the quantifiers used in numerical text analysis for judging the similarity between texts, as follows:<o:p></o:p></div><div class="MsoNormal">Consider a text, either a web document, an essay or a book. Perform an e e cummings on it, which is jargon for the following set of operations: remove all punctuation and other non-letter keyboard characters, lowercase everything and then parse it on whitespace. What you are left with is a collection of words. Now remove various small and frequent “stop” words like “a”, “the”, “is” etc., which account for about a third of Shakespeare’s<span> </span>vocabulary. Now count the number of occurrences of the distinct words (or “word-tokens”). <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">If the word tokens are ordered, the list of the corresponding number of occurrences in a text is the “word histogram” of the text. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued vector space. So lets call these things “word vectors”.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Any given text is represented by a point in this Word vector Space. In numerical text analysis (after TFLDF if one so chooses) one wants a way to be able to quantify the relatedness or similarity between two texts – each represented as a vector, and the simplest <span> </span>thing one can do with two vectors is … scalar product! Which, as one knows, is the product of the magnitudes of the two vectors and the Cosine of the angle between them. Now as far as “similarity” is concerned, the magnitudes of the word vectors don’t matter (Half of “Romeo and Juliet” is similar to the whole text.).<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">So we have the numerical “Cosine Similarity”: it is close to 1 for vectors in almost the same direction and 0 for perpendicular vectors. Since the space is non-negative, there are<span> </span>no pairs of anti-parallel vectors with Cosine Similarity = -1.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Cosine similarity as a cardinal number: You can’t add it, you can’t subtract it, in fact you can’t do any of the numerical operations on it. It isn’t a distance function, nor are its additive or multiplicative inverses. From the point of view of trying to do any geometry on this space, it is quite useless. <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">If what one is really trying to quantify is the angle between the two vectors, Cosine is a very poor substitute. Since its slope at 0<sup>o</sup> is zero, it discriminates very poorly between close neighbours, which is precisely the region of interest. Other absurdities are manifest: vector 60 degrees apart from each other have a Cosine Similarity of 50%, vectors 30 degrees (a third of the way to perpendicular to each other) have a Cosine Similarity of 86%! Bit vectors with only half their bits in common have a Cosine Similarity of 70%, but they also have a Sine Dissimilarity of 70%!<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Now it turns out that ArcSine of the square-root of (1 minus the square of the Similarity) <i>is</i> a good metric distance. OK, so this is just the angle that Cosine is the Cosine of. So Cosine has possible value for calculating the angle between the vectors. Except for that pesky vanishing slope, which causes an amplification of errors at small angles. More on this when we consider the ordinal virtues of Cosine.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">So one <i>should</i> use the angle, it is the metric distance on projective word vector space and one can use it to do all kinds of geometrical stuff – cluster density is an example.<o:p></o:p></div><div class="MsoNormal">However, to calculate it there are better, less error-prone trigonometric means: whereas Cosine(theta) = <b>a</b> . <b>b</b> for unit vectors, <o:p></o:p></div><div class="MsoNormal">Sine(theta/2) = |<b>a</b> – <b>b</b>|/2 is monotonic increasing in theta and strictly monotonic except at theta =180, which is in some sense the least interesting angle. <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">What about that theorem in Manning and Sch<span>ü</span>tze<a href="http://www.blogger.com/post-create.g?blogID=6519535535434149951&pli=1#_ftn1" name="_ftnref1" title=""><span class="MsoFootnoteReference"><span><!--[if !supportFootnotes]--><span class="MsoFootnoteReference"><span style="font-family: Cambria; font-size: 12pt;">[1]</span></span><!--[endif]--></span></span></a>, which proves that Cosine leads to the same ordering as does the angle? Yes, it is mathematically true, and obvious, I might add, since Cosine is monotonic. However, it is not true computationally: an error of 0.025 in the Cosine spells the difference between 2 degrees and 12 degrees, which could completely invert the ordering!<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">In conclusion: Cosine Similarity is no good as a cardinal quantifier, no good as an ordinal quantifier, and horrendous for calculating theta.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">As far as what I propose to do, you’ll just have to wait for the movie.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Cosine Similarity Murdabad!<a href="http://www.blogger.com/post-create.g?blogID=6519535535434149951&pli=1#_ftn2" name="_ftnref2" title=""><span class="MsoFootnoteReference"><span><!--[if !supportFootnotes]--><span class="MsoFootnoteReference"><span style="font-family: Cambria; font-size: 12pt;">[2]</span></span><!--[endif]--></span></span></a> <o:p></o:p></div><div class="MsoNormal"><br />
</div><div><!--[if !supportFootnotes]--><br clear="all" /> <hr align="left" size="1" width="33%" /> <!--[endif]--> <div id="ftn1"> <div class="MsoFootnoteText"><a href="http://www.blogger.com/post-create.g?blogID=6519535535434149951&pli=1#_ftnref1" name="_ftn1" title=""><span class="MsoFootnoteReference"><span><!--[if !supportFootnotes]--><span class="MsoFootnoteReference"><span style="font-family: Cambria; font-size: 12pt;">[1]</span></span><!--[endif]--></span></span></a> An excellent introductory text on “Foundations of Statistical Natural Language Processing”.<o:p></o:p></div></div><div id="ftn2"> <div class="MsoFootnoteText"><a href="http://www.blogger.com/post-create.g?blogID=6519535535434149951&pli=1#_ftnref2" name="_ftn2" title=""><span class="MsoFootnoteReference"><span><!--[if !supportFootnotes]--><span class="MsoFootnoteReference"><span style="font-family: Cambria; font-size: 12pt;">[2]</span></span><!--[endif]--></span></span></a> Roughly translates as “Die! Die! Die!”<o:p></o:p></div></div></div><!--EndFragment--> ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-58946346712890936512011-08-13T13:16:00.000-07:002011-08-13T13:16:36.372-07:00Proof of Conjecture for paths in hypercube lattices<b>Problem:</b><br />
<ol><li>hypercubical lattice in D dimensions</li>
<li>start from O = (0,0...)</li>
<li>allowed moves are only of the form (0,0,+, 0,0, ...)</li>
<li>End point of path is (N1, N2,...ND)</li>
</ol>What is the number of paths?<br />
<br />
<b>Answer:</b><br />
The multinomial coefficient<br />
(N1+N2 ... ND)C(N1)(N2)...(ND) = (N1+N2...+ND)! / ( N1! N2! ...ND!)<br />
<br />
<b>Proof:</b><br />
Let the whole number valued coordinate axes be n_i, i = 1...D.<br />
The end point lies on the hyperplane \sum(n_i) = \sum(N_i).<br />
One has to choose \sum(N_i) times to get to that hyperplane, from between D choices each time (the choice of direction to move in).<br />
To get to the point (N_1, N_2..., N_D), exactly N_i choices have to have been in the i- direction<b>.</b><br />
Hence the number of paths is the number of ways of choosing the above partition from its sum.<br />
<br />
In 3D, this is the number of ways of choosing 3 Easts, 5 Norths and 2 Ups from 10 choices to reach the point 3E, 5N and 2U.<br />
<br />
Hence the multinomial coefficient. QED<br />
<br />
As a little extra: lets say we were only interested in paths between diagonally opposite points of the D-hypercube, say (0,0,...) and (N,N,...).<br />
<br />
Then the number of paths is (D*N)!/(N!)**D, which was my conjecture two days ago!!!<br />
<br />
<br />
Now, the Analytical answer can be produced with 2 lines more of code: one to read the dimension from the command line, the other to read the coordinates of the end point.<br />
<br />
The codeCounting program would need a bit more, a couple of <b>for</b> loops to write out the D-dimensional python code (if needed) and then to call it. <br />
<br />
But the amount of time it is going to take is going to be ... HELP?<br />
<br />
Last point for today: The factorial function can be extended to complex plane except negative integers by the <a href="http://en.wikipedia.org/wiki/Gamma_function">Gamma function</a>.<br />
<br />
So now I can legitimately answer the question: How many paths are there to a say integer diagonal point in a <b>fractional</b> dimension?<br />
<br />
Answer = Gamma(N*D +1)/(N!)**D.<br />
<br />
_ : Who cares about your perspective? What practical value can it possibly have?<br />
<br />
Here goes: Play fast and loose with "fractional" = "fractal". Now we actually have something we can visualize: paths on a square Sierpinski gasket e.g.<br />
<br />
And the practical use? Conjecture: The structure of the web or networks on the web is <b>fractal</b>. There is fractal behaviour even in Statistical Natural Language Processing, Benoit Mandelbrot did important work on it in the 50's.<br />
<br />
Lets say that the LinkedIN network is fractal. Maybe the above could be of some use in counting paths, for a better calculation of the clustering coefficient of an egonet.<br />
<br />
From the perspective of SNLP, simple extensions of the above counting processes lead to the distributions of sizes of stochastically generated words, e.g with truncation on choice of ' '. <br />
<br />
Ranjeet<br />
<br />
<br />
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-36495477532830172342011-08-13T12:35:00.000-07:002011-08-13T12:35:37.231-07:00simplifying the boundary conditionsLast night and this morning when I woke up I was thinking of the extension to higher dimensions. One thought was that the number of elif conditions with special case sums for the points on the boundary would grow as 2**D -1, all the "lower" edges, faces, vertices etc. Even if I wrote python code to<br />
<ol><li>take in the dimension as an argument and </li>
<li>have it write python code for the dimension required (which I'll do in a minute)</li>
</ol>this would be quite tedious.<br />
Then I realised that the recursive self-calls stop when they encounter a value, so I can just set boundary conditions for "parent nodes" on the enveloping lower hyper-planes (of which there are only D) to be 0.<br />
<br />
Here is the somewhat more "biutiful" code (I rented it, but ended up watching Terminator 3 instead):<br />
<br />
#!/usr/bin/python <br />
""" 'codeCountPaths1.py Natural [Natural]' <br />
Single argument -> (n,n) <br />
code to count the number of paths <br />
Assumption 1: starting at (0,0) <br />
Assumption 2: ending at user input (n,m) <br />
moving in (+,+) direction only, on a complete Euclidean square lattice. <br />
CHANGE from codeCountPaths.py: <br />
A branch of the recursion will stop when it encounters a value, so <br />
BCs can be accounted for by putting in lines of <br />
0s at n= -1 and m = -1 as opposed to <br />
restricting the sum for boundary points to just those parent nodes <br />
which lie within the domain. There is no need for the 0-valued <br />
parent nodes to go back to the n+m = 0 line (or soon2B hyperplane) <br />
"""<br />
import sys, math<br />
def NumPaths(i,j,n,m):<br />
if i==0 and j==0:<br />
numpaths = 1 # BC at origin<br />
elif ((i==-1) and j in range(1,m+1)) or ((j==-1) and i in range(1,n+1)):<br />
numpaths = 0 # BCs at enveloping lower lines<br />
else:<br />
numpaths = NumPaths(i-1,j,n,m) + NumPaths(i,j-1,n,m)<br />
return numpaths<br />
def main():<br />
# reads command line arguments for the location of the ending point <br />
if len(sys.argv) < 2:<br />
print "usage: $ python codeCountPaths.py Natural [Natural]"<br />
elif len(sys.argv) == 2:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[1])<br />
elif len(sys.argv) ==3:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[2])<br />
<br />
# scolds user for trying to trick poor little machine <br />
if (n < 0) or (m < 0):<br />
print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive intege\<br />
r only please"<br />
# asks for answer, and sits back for an awfully long time <br />
ans = NumPaths(n,m,n,m)<br />
print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\<br />
e Euclidean square lattice moving only in the (+,+) direction = ", ans<br />
if __name__ == '__main__':<br />
main()<br />
<br />
Question: I have a recursion relation for the number of time-steps needed, but can't solve it analytically, neither can I correctly code to just count each time a call is made.<br />
<br />
Can one of you CS gods help put better bounds on "awfully long time"?<br />
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-8867469543179218272011-08-13T03:04:00.000-07:002011-08-13T03:16:08.875-07:00Code for A BEAUTIFUL BEAUTIFUL PROBLEMBeautiful code!!!<br />
<br />
OK: I can't believe<br />
0) yesterday's code was so Uuugly!!<br />
1) I completely missed all sorts of beautiful things and the joy of having coded this problem (succesfully, meaning without crashing my computer or bringing down the whole LinkedIn site even though I had been careful enough to close all connections before hitting 'return')<br />
2) Yesss! it was also correct!<br />
<br />
Yesterday's code:<br />
#!/usr/bin/python \ <br />
""" 'diagPaths.py' """<br />
import sys, math<br />
def diagN(n):<br />
if n < 1:<br />
print "tricks later, for now positive integer only please"<br />
else:<br />
return math.factorial(2*(n-1))/math.factorial(n-1)**2<br />
def main():<br />
if len(sys.argv) < 3:<br />
# <br />
print "usage: $ python diagPaths.py --n natural number"<br />
elif sys.argv[1] == '--n':<br />
n = int(sys.argv[2])<br />
ans = diagN(n)<br />
print "Number of paths between diagonal vertices for ",n, "-square lattice \<br />
= ", ans<br />
if __name__ == '__main__':<br />
main()<br />
<br />
3) If you want to exploit the beauty and power of recursion, code backwards! The number of paths to get to (i,j) is = numpaths(i-1,j) + numaths(i,j-1). Then just ask for numpaths(whatever you want) and let the code call itself ...<br />
<br />
4) ... Modulo the bounds: while thinking about the bounds I realised that the upper bounds don't constrain anything. Which in turn means ...<br />
<br />
5) The number of paths to the point <i>(i,j)</i> from <i>(0,0)</i> is just <i>(i+j)C(j)</i>. Why? It is the <i>i+j</i> th diagonal, and the point is the <i>j</i>th from one end of it, so it is just the Binomial coefficient given above.<br />
<br />
6) Which further means that we have an alternative proof of the sum of the squares of the binomial coefficients: yesterday I explained that the numpaths to get to (n,n) is the sum of the squares of the numpaths to get to points on the transvecting diagonal, which is the sum of the squares of the Binomial coefficients. Today, I've explained that the numpaths to get to (i,j) is <i>(i+j)C(j)</i>. Hence to get to (n,n) the numpaths is <i>(2n)C(n)</i> = (2n)!/(n!)^2 !!!! - That is exciting, not factorials.<br />
<br />
Analytical answer:<br />
<br />
#!/usr/bin/python <br />
""" 'AnalyticalCountPaths.py' 1 argument -> end point (n,n) <br />
2 arguments -> (n,m) <br />
assume complete square Euclidean lattice <br />
starting point = (0,0) <br />
"""<br />
import sys, math<br />
def main():<br />
if len(sys.argv) < 2:<br />
print "usage: $ python AnalyticalCountPaths.py Natural [Natural]"<br />
elif len(sys.argv) == 2:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[1])<br />
elif len(sys.argv) ==3:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[2])<br />
if (n < 0) or (m < 0):<br />
print "EEp! Bloink! HAL No Compute, go fall into pole of Riemann Zeta. BoundError"<br />
print "For now positive integers only please"<br />
else:<br />
ans = math.factorial(n+m)/(math.factorial(n)*math.factorial(m))<br />
print "Analytical solution: Number of paths from (0,0) to (",n,', ',m,") on a\<br />
complete Euclidean square lattice moving only in the (+,+) direction = ", ans<br />
print "\n Oh my but this problem is beautiful!! Alternate proof of sum of squ\<br />
ares of Binomial coefficients!"<br />
if __name__ == '__main__':<br />
main()<br />
<br />
so why do I need to write the code for actually counting this?<br />
Because, I can't believe I am saying this, it is EVEN more beautiful!<br />
<b>AND FINALLY</b>:<br />
#!/usr/bin/python <br />
""" 'codeCountPaths.py Natural [Natural]' <br />
Single argument -> (n,n) <br />
code to count the number of paths <br />
Assumption 1: starting at (0,0) <br />
Assumption 2: ending at user input (n,m) <br />
moving in (+,+) direction only, on a complete Euclidean square lattice. <br />
"""<br />
import sys, math<br />
def NumPaths(n,m):<br />
if (n < 0) or (m < 0):<br />
print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive integer only please"<br />
elif n==0 and m==0:<br />
numpaths = 1<br />
elif n == 0:<br />
numpaths = NumPaths(n, m-1)<br />
elif m == 0:<br />
numpaths = NumPaths(n-1,m)<br />
else: # THE MEAT OF THE CODE<br />
numpaths = NumPaths(n-1,m) + NumPaths(n,m-1)<br />
return numpaths<br />
def main():<br />
if len(sys.argv) < 2:<br />
print "usage: $ python codeCountPaths.py Natural [Natural]"<br />
elif len(sys.argv) == 2:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[1])<br />
elif len(sys.argv) ==3:<br />
n = int(sys.argv[1])<br />
m = int(sys.argv[2])<br />
ans = NumPaths(n,m) # <b>here is where I ask for the answer just once!</b><br />
print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\<br />
e Euclidean square lattice moving only in the (+,+) direction = ", ans<br />
print "\n Oh my but this problem is beautiful!!"<br />
if __name__ == '__main__':<br />
main()<br />
<br />
I'll confess I was nervous when I ran this: I'd already calculated the answer in my head, and was praying to Pascal that my codecounting would be right!<br />
<br />
Did I thank _ enough for this problem?<br />
Tomorrow: we'll relax the Euclidean requirement, then the "completeness".<br />
<br />
By The Way: I should remind myself that any second year Numerical Path Integral Physics Grad student is probably doing this in their sleep.<br />
<br />
ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0tag:blogger.com,1999:blog-6519535535434149951.post-60917044229211761592011-08-11T21:25:00.000-07:002011-08-11T21:25:01.906-07:00A BEAUTIFUL BEAUTIFUL PROBLEM<div dir="ltr" style="text-align: left;" trbidi="on"> <link href="file://localhost/Users/rtate/Library/Caches/TemporaryItems/msoclip/0/clip_filelist.xml" rel="File-List"></link> <!--[if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--> <link href="file://localhost/Users/rtate/Library/Caches/TemporaryItems/msoclip/0/clip_themedata.xml" rel="themeData"></link> <!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>JA</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> <w:UseFELayout/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="--"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276"> <w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/> <w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/> <w:LsdException Locked="false" Priority="39" Name="toc 1"/> <w:LsdException Locked="false" Priority="39" Name="toc 2"/> <w:LsdException Locked="false" Priority="39" Name="toc 3"/> <w:LsdException Locked="false" Priority="39" Name="toc 4"/> <w:LsdException Locked="false" Priority="39" Name="toc 5"/> <w:LsdException Locked="false" Priority="39" Name="toc 6"/> <w:LsdException Locked="false" Priority="39" Name="toc 7"/> <w:LsdException Locked="false" Priority="39" Name="toc 8"/> <w:LsdException Locked="false" Priority="39" Name="toc 9"/> <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/> <w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/> <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/> <w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/> <w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/> <w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/> <w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/> <w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/> <w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/> <w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/> <w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/> <w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/> <w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/> <w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/> <w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/> <w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/> <w:LsdException Locked="false" Priority="37" Name="Bibliography"/> <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/> </w:LatentStyles> </xml><![endif]--> <style>
<!--
/* Font Definitions */
@font-face
{font-family:"MS 明朝";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:128;
mso-generic-font-family:roman;
mso-font-format:other;
mso-font-pitch:fixed;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:"MS 明朝";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:128;
mso-generic-font-family:roman;
mso-font-format:other;
mso-font-pitch:fixed;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:Cambria;
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-536870145 1073743103 0 0 415 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.WordSection1
{page:WordSection1;}
-->
</style> <!--[if gte mso 10]> <style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style> <![endif]--> <!--StartFragment--><span style="font-family: Cambria; font-size: 12pt;"></span><br />
<span style="font-family: Cambria; font-size: 12pt;"></span><br />
<!--[if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--> <!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>JA</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> <w:UseFELayout/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="--"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276"> <w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/> <w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/> <w:LsdException Locked="false" Priority="39" Name="toc 1"/> <w:LsdException Locked="false" Priority="39" Name="toc 2"/> <w:LsdException Locked="false" Priority="39" Name="toc 3"/> <w:LsdException Locked="false" Priority="39" Name="toc 4"/> <w:LsdException Locked="false" Priority="39" Name="toc 5"/> <w:LsdException Locked="false" Priority="39" Name="toc 6"/> <w:LsdException Locked="false" Priority="39" Name="toc 7"/> <w:LsdException Locked="false" Priority="39" Name="toc 8"/> <w:LsdException Locked="false" Priority="39" Name="toc 9"/> <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/> <w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/> <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/> <w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/> <w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/> <w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/> <w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/> <w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/> <w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/> <w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/> <w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/> <w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/> <w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/> <w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/> <w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/> <w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/> <w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/> <w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/> <w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/> <w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/> <w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/> <w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/> <w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/> <w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/> <w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/> <w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/> <w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/> <w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/> <w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/> <w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/> <w:LsdException Locked="false" Priority="37" Name="Bibliography"/> <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/> </w:LatentStyles> </xml><![endif]--> <style>
<!--
/* Font Definitions */
@font-face
{font-family:"MS 明朝";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:128;
mso-generic-font-family:roman;
mso-font-format:other;
mso-font-pitch:fixed;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:"MS ゴシック";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:128;
mso-generic-font-family:modern;
mso-font-format:other;
mso-font-pitch:fixed;
mso-font-signature:1 134676480 16 0 131072 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-536870145 1107305727 0 0 415 0;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-520092929 1073786111 9 0 415 0;}
@font-face
{font-family:Cambria;
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-536870145 1073743103 0 0 415 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
h1
{mso-style-priority:9;
mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-link:"Heading 1 Char";
mso-style-next:Normal;
margin-top:24.0pt;
margin-right:0in;
margin-bottom:0in;
margin-left:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan lines-together;
page-break-after:avoid;
mso-outline-level:1;
font-size:16.0pt;
font-family:Calibri;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:"MS ゴシック";
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
color:#345A8A;
mso-themecolor:accent1;
mso-themeshade:181;
mso-font-kerning:0pt;}
h2
{mso-style-priority:9;
mso-style-qformat:yes;
mso-style-link:"Heading 2 Char";
mso-style-next:Normal;
margin-top:10.0pt;
margin-right:0in;
margin-bottom:0in;
margin-left:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan lines-together;
page-break-after:avoid;
mso-outline-level:2;
font-size:13.0pt;
font-family:Calibri;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:"MS ゴシック";
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
color:#4F81BD;
mso-themecolor:accent1;}
span.Heading1Char
{mso-style-name:"Heading 1 Char";
mso-style-priority:9;
mso-style-unhide:no;
mso-style-locked:yes;
mso-style-link:"Heading 1";
mso-ansi-font-size:16.0pt;
mso-bidi-font-size:16.0pt;
font-family:Calibri;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:"MS ゴシック";
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
color:#345A8A;
mso-themecolor:accent1;
mso-themeshade:181;
font-weight:bold;}
span.Heading2Char
{mso-style-name:"Heading 2 Char";
mso-style-priority:9;
mso-style-unhide:no;
mso-style-locked:yes;
mso-style-link:"Heading 2";
mso-ansi-font-size:13.0pt;
mso-bidi-font-size:13.0pt;
font-family:Calibri;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:"MS ゴシック";
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
color:#4F81BD;
mso-themecolor:accent1;
font-weight:bold;}
.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"MS 明朝";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.WordSection1
{page:WordSection1;}
-->
</style> <!--[if gte mso 10]> <style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style> <![endif]--> <!--StartFragment--><o:p></o:p> <br />
<h2>& BEAUTIFUL CODE WITHOUT CODE<o:p></o:p></h2><div class="MsoNormal"><br />
</div><div class="MsoNormal">The problem, as I recall it being described to me, is to calculate the number of ways to get from one vertex of a square lattice (of size <i>N</i> points per side) to the diagonally opposite one, where along either axis each move is always in the same direction, say ++. It is or used to be a popular interview question to test the interviewee’s coding skills. <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">My zeroth thought … I’ll come back to that later.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">My 1.1th thought was of course, an <b>iterative</b> approach – figure out a process to get from the <i>k-1</i>th diagonal to the <i>k</i>th: at each step of the iteration use the distribution<span> </span>of ways to get to the <i>p</i>th point on the <i>k-1</i>th diagonal to calculate the number of ways to get to the <i>q</i>th point on the <i>k</i>th, store this new distribution. Start from <i>k=2</i> and repeat until <i>k=N</i>. OK, that gets us to the diagonal. Thought 1.1 was <b>divide and conquer</b>: get to the transvecting diagonal between the vertices, square the distribution and then add.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">My second thought was to take a <b>recursive</b> approach for the first part (the distribution on the diagonal): <o:p></o:p></div><div class="MsoNormal">Def diag(k):<o:p></o:p></div><div class="MsoNormal"><span> </span>If k ==1:<o:p></o:p></div><div class="MsoNormal"><span> </span>Return [1]<o:p></o:p></div><div class="MsoNormal"><span> </span>Else:<o:p></o:p></div><div class="MsoNormal"><span> </span>Oldlist = diag(k-1)<o:p></o:p></div><div class="MsoNormal"><span> </span>Newlist = [ oldlist[i] + oldlist[i+1] for i in range (0,k) ]<o:p></o:p></div><div class="MsoNormal"><span> </span>Return newlist<o:p></o:p></div><div class="MsoNormal">I had forgotten how much more beautiful recursion is than iteration! <o:p></o:p></div><div class="MsoNormal">All you do is call diag(N), square the elements and sum!<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">I think the above goes as n**3.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Time to implement and try it:<o:p></o:p></div><div class="MsoNormal">$ python diagonalPaths.py --n <i>natural number</i> <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">OK, so for n =1,2,3,4,5 we get 1,2,6,20,70.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">So how do we know it is right? Let’s say you’ve debugged the code, validated it for<span> </span>n=1,2,3. It agrees with what the interviewer has. The answer agrees with the consensus, still, how do we know that it is right?<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Put another way, can you predict the answer for n = 6 or any n, before you run your code, or at least, in less than n**3 time?<o:p></o:p></div><div class="MsoNormal">I’ll claim that my code runs in n**1 time. How?<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Well, I didn’t code the recursion. Which brings me to my 0<sup>th</sup> thought, I can do it analytically, why code it?<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">One can readily see that the distribution on the diagonal is the binomial, since at every node you are making the choice between +/--, and the number of ‘+’ gets you the position along the diagonal. If we just wanted to get to the diagonal, the number of ways is 2**(n-1). So the answer we want is the sum of the squares of the binomial coefficients, which luckily has a closed form : (2N)!/(N!)**2. If ! goes as N-time, so does my code.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Comment 0: I can also ask the question: “to calculate the number of ways to get from one vertex of a square lattice (of size <i>N</i> points per side)”, where <i>N</i> is:<o:p></o:p></div><div class="MsoNormal">Non-positive<o:p></o:p></div><div class="MsoNormal">Rational<o:p></o:p></div><div class="MsoNormal">Real (for the computer this should be pseudo-real or float I think)<o:p></o:p></div><div class="MsoNormal">Complex<o:p></o:p></div><div class="MsoNormal">Matrix (real or complex valued)<o:p></o:p></div><div class="MsoNormal">Anything which forms a group under multiplication, or anything that forms a group under a group operation which satisfies the properties of multiplication, with the relaxation of the commutativity requirement.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Since n! = Gamma(n+1), any of the above can be calculated (convergence for matrices will be harder to prove). </div><div class="MsoNormal"><br />
</div><div class="MsoNormal">I wouldn’t have the slightest idea how to code any of the above!<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Comment 1: What about lattices of different dimensions? Now here you’ve got me, I can’t avoid coding. If we are only interested in the number of ways of getting to the nearest transvecting diagonal plane, the answer is simply D**(n-1). But it is no longer sufficient to square the multinomial coeff.s and add, since we have to get to the other transvecting planes before getting from the last one to the opposite vertex. There are two such planes in 3D, with normal vector (1,1,1) and at distances 1/sqrt(3) and 2/sqrt(3) from the origin. Similarly in higher dimensions.<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal"><o:p>Now if you construct the following D dimensional solids: namely to take two D-simplices and to join them on their D-hedron faces, then we are reduced to summing the squares of the multinomial coefficients. But otherwise:</o:p></div><div class="MsoNormal"><b>the “coding” value of this problem lies in its extension to higher dimensions.</b><o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Comment 1.1: I am going to take a reasonable guess and claim that the answer in D-dimensions is (Dn)!/(n!)**D. Again, by using the gamma function (and extending it to Complexes), we can allow <i>both</i> n and D to take on complex values. I have no idea what a complex dimension looks like (though I suspect that John Conway and others do), but a fractional dimension, or at least a real valued fractal dimension, that I can visualize. The famous <a href="http://en.wikipedia.org/wiki/Sierpinski_triangle">Sierpinski gasket</a>, also <a href="http://www.jimloy.com/fractals/sierpins.htm">http://www.jimloy.com/fractals/sierpins.htm</a> has a fractal dimension (Box counting = Hausdorff) of 1.585. Similarly, if I construct the square equivalent (taking away 4 of the nine subsquares of a tictactoization of the square at every recursion), I get a lattice of fractal dimension ln(5)/ln(3) ~1.465 (5 copies, 1/3 linear dimension each). I can readily imagine that the constraints of having to pass through certain lattice points reduces the number of paths relative to the number in a square lattice. <o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Comment 2: What a beautiful introduction to Path Integrals! The function of paths we are “integrating” above is just 1. But you can readily imagine summing the length L of the path over all paths, or say the phase change over the path, exp(i L), or indeed any other function of paths. If you use L, you are doing geometric optics, if you use some particle action S, why, “welcome to quantum mechanics!”<o:p></o:p></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">I’d like to thank _ for suggesting this problem.<o:p></o:p></div><div class="MsoNormal"><br />
</div><!--EndFragment--> <span style="font-family: Cambria; font-size: 12pt;"></span><!--EndFragment--> </div>ClimberT8http://www.blogger.com/profile/12302281220153714545noreply@blogger.com0