Thursday, March 10, 2016

Push your configuration to the limit - spice it up with statistics

So - you built a service with an advanced algorithm that uses some kind of classification in order to detect certain events - How would you know your configuration data is optimal? How would you measure the impact on accuracy each time the algorithm is changed?

The following post will show you how to use a mixture of concepts from Machine Learning, Statistics &  Operations Research in order to maximize configuration set of an algorithm.
This method is highly efficient in a client-side classification mechanism that requires accuracy but lacking learning time or processing power to do it.

The suggested method is divided into 5 phases:

  • Test data preparation
  • Running the test
  • Result measurement
  • Analysis
  • Determine the winner


Test data preparation

Prepare a CSV file with data that have an indication of its correctness. e.g. if the algorithm should detect cars, we would have a data such as:
Toyota,1
mouse,0
Nissan,1
keyboard,0
Ford,1
button,0

Where 0 is an indication that the data is not a car and 1 is an indication that the data is a car.
The amount of data set will be discussed later on and might deserve a separate post, however, for simplification, assume we have 10 tests with 5 correct data and 5 incorrect data.


Running the test

Building a test to send data to your algorithm
The overall concept is clear:

  1. Read the test data
  2. Send it to your algorithm
  3. Record the result (i.e. whether the algorithm identified the data correctly)

However, since we want to find out the optimal configuration we shall add additional step, before sending the data to the algorithm:
Set the algorithm with a certain configuration, meaning that the parameters your algorithm use would be able to be configured, for each run.
So the high level run steps should be:

  1. Read the test data
  2. Prepare algorithm configuration permutations
  3. Send the data to your algorithm for each configuration permutation
  4. Record the result for each run
Few things to note:

  • Your algorithm should always return a result whether it identified the data or not, so you could record true negative and false negative.
  • Externalize your configuration, enable to set it from outer class, as it required to do with each configuration set.
  • Shuffle the test data between each run & execute multiple runs per specific configuration set. Shuffling the data will  enable to reduce correlation to the data order, executing multiple runs with the same configuration will help reduce any random standard deviation.


Result measurement

When your test gets the data back from the algorithm compare the algorithm predicted condition (positive/negative) versus the true condition, based on the indication in the test data.
Make sure to count the following:

  • TruePositive (TP) = correctly identified, a hit!
    Classified correctly as positive
  • True negative (TN) = correctly rejected, equivalent to correct rejection
    Classified correctly as negative
  • False positive (FP) = incorrectly identified, equivalent to false alarm (type I error)
    Classified wrongly as positive
  • False negative (FN) = incorrectly rejected, a miss (type II error)
    Classified wrongly as negative


The algorithm accuracy would be measured as follow:
Accuracy (ACC) = (TP + TN)/(TP + FP + FN + TN)


I like also to measure the following:

  • True Positive Rate (TPR), also known as sensitivity:
    TPR = TP/(TP + FP), If all your positive data were identified correctly, the value would be 1, so as closer the result is to 1 - the better.
  • True Negative Rate, also known as specificity:
    NPV = TN/(TN+FN), If all the negative data were identified correctly, the value would be 1, so as closer the result is to 1 - the better.
  • False Positive Rate (FPR), also known as fall-outs:
    FPR = FP/(FP + TN), The closer the value is to 0, means fewer false alarm.
  • False Negative Rate (FNR):
    FNR = FN/(TP+FN), The closer the value is to 0, means fewer missed out.

Cost function:
Highly important.
This is where you quantify the statistical analysis and the results into a single score that will help to determine how good the overall run/configuration is and will also make it simpler to compare between runs.
The cost function formula would look as follows:

Where CTP Is the cost of True Positive results, CTN is the cost of True Negative results and so on.

In a simplified manner, you need to sum each occurrence based on a certain weight.
You need to decide how much weight to give each data result, i.e. the value of TP, TN, FP & FN.
This may vary for each customer need, for example, some customers would like your algorithm to provide more results at the expense of quality, so you would give less penalty/weight on FP / FN result. Others would be very sensitive to faulty results, so your cost function would give higher value/weight to FP.
Note: I would recommend setting the cost value of TP and TN as such that if all the results are correct the expected sum value would be round number. For example, if we have 10 tests, 5 positive and 5 negatives, we would give each TP & TN the cost value of 10, so the overall score of a fully successful run would be 100.
Save in a file the configuration permutation and its score.


 Analysis

Put all the run result in excel. 
It is very easy to find the maximum cost function score and its configuration set, however, to find the most robust configuration, find the configuration value which had the best average across other permutations. To do this create a pivot table where all the configuration values are in the "Rows" section and the drag the cost function score field into the "Sum Value" section. Duplicate the score field in the "Sum Value" section and set one as an average function and the other as Max function.
Since all the configuration set is in the Rows section in the pivot table,  it is possible to aggregate it so only the first value is shown.
The final step is to write the best value per average score. Sometimes the a certain configuration value best average score would match the best maximum score, however, sometimes it won't.

In the below images you can see a demonstration of a certain configuration value and its average and maximum score. You could see that not always there is a correlation between average and maximum score.
In the below image a certain algorithm configuration has values 40,50,60 & 70. 
The value 70 appears in the configuration permutation of the maximum score and also has the best average. Having the best average means that setting the value 70 would give the best average score for any other configuration permutations. It implies about the robustness of this configuration.


In the below image we choose another algorithm configuration parameter with the values of 0,1 & 2. 
We did it by dragging the relevant configuration parameter field in the pivot table "Rows" section to the first position.
 You can see that the best average cost function score is for value 0 and the best maximum cost function score  is for the value 1.


Determine the winner with a final contest

At the end of the analysis phase, you would have a configuration set that provides the maximum score and a configuration set that provide the best average score.

This phase will try to minimize any implication the algorithm might have due to the test data, i.e. we want to rule out the possibility the algorithm is biased to a certain test data.

Create a test data with the following:
  • 80% positive tests
  • 70% positive tests
  • 30% positive tests
  • 20% positive tests
Execute each test data on each of the maximum and average configuration set.
The winner would be based on your customer need and the given result. If there isn't much variation between the runs score, it means that your algorithm is robust and you can choose the one that gave you maximum score, however, if there is big variation, it means that your algorithm is data dependent and it up to your customer needs whether to choose the best average or the best maximum.





An extra bonus getting using this method:

  • Test how your algorithm utilized resources over time (Memory leaks, CPU utilization, disk space,..)
  • Enable to accurately measure changes in the algorithm over time, across versions.


Any feedback is highly welcome.


Further good reads:
http://www.chioka.in/class-imbalance-problem/
https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Monday, February 29, 2016

2 Android Studio productivity tips you MUST know

This post will handle frequent tasks one usually do in Android Studio:

  • Modifying layout XML files
  • Observing logcat output
If you do that actions, make sure you aware of the below tips.


Modifying layout XML files
When modifying XML layout files, one has to decide the ultimate decision: 
Which view to use - Design or Text view?


Design view provides you a good visual presentation of the layout, however, the text view provides faster property control easier to edit. So eventually, one would find themselves switching back and forths between those tabs.

Fortunately, you can have them both and gain productivity!

Choose the Text tab and you would see a collapsed "Preview" text on the right side of the ide (see image below)


It will open a small preview tab slider on the right. Every change that would be done in the Text tab will be reflected immediately in the preview.  Every element you will select in the preview will highlight the relevant element in the Text view.

This is how this dual view might look like:

Android Studio Text and Visual Preview on the same view



Observing logcat output
Logging into logcat is one of the basic fundamentals every android developer use.
It is very useful to trace application output and trace execution.
Often one would compare log output from 2 sessions in order to find differences.
Android Studio has a built-in tool for that, which will make the comparison super-fast.

The first step is to select the first output you want to compare and copy it to the memory (i.e. Cntl+C).
The image below shows a sample selection & copy from a logcat:


The second step is to filter the logcat to present the latest run (you don't have to use filter, just make sure it is visible in the logcat) , select it and right-click on the selection. 
In the context menu you will see an option to compare the selected text with the clipboard.

Right Click context menu on selection

Choose this option and you will see a diff tool.  

Android Studio Diff tool



That's it. Hope you will find it useful.

Please include this link to the original post.

Wednesday, February 17, 2016

2 ways where gradle overpower maven in Android build

As Android Studio which is the official IDE for Android based project use gradle as it's build platform, it made many of the maven users build their project using gradle.

As I constantly learn new features about gradle, I often reflect on the differences between  these two build systems.

The following observations are mainly focused on my experience in Android development.

3rd party libraries integration -
Assuming you use Maven in Eclipse based project, the process is tedious including adding the entire project source as a project, marking it as library dependency & modifying manually the permissions in the AndroidManifest.xml files.
In gradle, you simply define the dependency.  It has the aar format concept when it will do all the merge automatically into the final apk.

Version control -
In maven, you must either specify a version for a given dependency or use the word SNAPSHOT to use a certain version that is subject to updates.
i.e if you define dependency in spring core:


 org.springframework
 spring-core
 4.2.4.RELEASE

or in its development version:

 org.springframework
 spring-core
 4.2.4.SNAPSHOT


You will always get the 4.2.4 version.

In gradle you can specify a dependency version in a + sign, indicating that you want to receive any new version (this concept is called dynamic version).
so if an artifact is released with a new version (different number) you would get it automatically.
This mechanism has even finer granularity as you can specify 3.+ and get only updates which related to version 3.

The reason maven has real issues with determining which version is last is because it allows setting alphanumeric characters in the version (e.g. alpha, beta, ..).
In gradle, you are encouraged to use numbers only and hence it would automatically recognize which is the latest.



include this link to the original post

Friday, February 12, 2016

10 tips on how to build the perfect SDK

This post was born as a query of a friend of mine who thought there was not enough documentation on writing a good SDK that others can easily use.

In the last decade, SDK usage has become a major part of the development lifecycle. In fact it is so commonly used and integrated into products, that one would say developers need to acquire more knowledge with many frameworks rather than learning deep algorithms to implement by themselves.

This post is mainly addressed those who want to learn how to write the best SDK and supply documentation for it.

The goal/orientation of an SDK is that its documentation should be focused and clear.

If you feel there is more than one focus area - consider splitting it.


Below is a list which I hope would help you construct an SDK in a good way and writing its documentation:


0.     Learn what is out there

Try to see what your competitors or companies in similar domain as yours have done.

This may give you a point of reference. Take what you like and improve what you did not like.

1.     Simplicity

Code - simple code means your consumers find it easy to use. This might include as few as possible way to interact with your code, e.g. expose only one interface class; short method signatures, e.g. small number of input parameters; etc.

Except of initiation, which occurs once and might require some configuration, make the usage of your SDK methods as simple as possible.

As such -  try to keep method signature with as few parameters as possible.

You can achieve this by providing default configuration and default implementation classes, that can be overridden by advanced users.

Hide any class and method consumers don't need to use, i.e. make classes/methods public only if consumers must use them, otherwise use local or private scope. Some IDEs can help you do it automatically via code inspection and cleanup.

Documentation: Make your documentation as simple as possible. This means sometimes to write more explanatory text and sometimes write as less as possible. Inline code samples often help, as most humans learn by example.

2.     Provide an easy start - the way someone can use your code in less than 5 minutes. This is important as consumers want as little integration effort as possible, moreover, sometimes consumers want to evaluate your product, and without an ability to easily experiment with it, they would probably choose to skip it.

3.     Keep it short

This section is mainly relevant for documentation, but is also related to the ways the consumers can interact with the SDK code. In regards to documentation; this can be achieved by providing code samples and self-explanatory method names and using defaults.

4.     Integration - keep in mind the diversity of your consumers development environments.

For example, If you are writing an android library, the integration with it vary if your consumers user Android Studio with gradle (that requires aar artifacts and publishing to a remote repository) or they use Eclipse where you need to provide jar files, instruction about how to change the AndroidManifest.xml and a standalone eclipse project for the SDK.

This would impact your build mechanism & it's artifacts. However, don't try to win it all from day one. Do what fit most to your first client or to most of your predicted consumers.

5.     Sample project

Create the most basic project in GitHub that simulates a client that uses your SDK.

This would demonstrate your consumers how your product can fit their needs as well the easiest way to integrate with your product.  If you want to show an advanced usage, do it in another project. Often your consumers would use it as their main source of documentation, so provide inline comments and write code in a self-explanatory way as possible.

6.     Overview - in the beginning of the documentation or in the README.md file in the GitHub project, provide an overview about your solution in plain English. In this section, I usually like to provide a sample use case that will explain a typical SDK usage. If possible, provide a simple diagram or chart so people that don't like to read manuals, will see the benefit of your SDK quickly.

7.     Initiation - use conventions that are acceptable in the SDK domain.

It may be constructors with overload, build pattern or similar. The initiation should smartly use defaults in order to keep easy start.

8.     Defaults 

Defaults are important to keep code simplicity and reduce configuration (see the simplicity section). The defaults you provide (either configuration or implementation) shall represent the way you think most of the consumers will use your SDK.

You can provide several method overloading, where the simplest signature would call the more advance method signature with the default.

9. Publishing

  • Offline use non-editable format - PDF. The advantage is that you can easily create one, store it locally in Dropbox and for each update, the version is updated automatically.
  • Online - your corporate website. This is the preferred way, however, it might create a hassle of some IT overhead when you update it.


Hope those guildelines helps. Feedback is more than welcome.

Please keep this link to the original post.


Monday, February 1, 2016

Android Gradle - build tips - Lint HTML layout


With the release of gradle 2.10 there is a nice feature of Lint check html reports.

Simply speaking, whenever you run a build, a nice formatted html report will be generated. It is a VERY nice touch to the lint xml reports and makes the report much more readable and hence more actionable. It will save you time and make your code more robust and stable.

This feature was contributed by Sebastian Schuberth .

To check it out do the following:

1. Download gradle 2.10 and extract it to the default Android Studio gradle directory.
It may be:
C:/Program Files/Android/Android Studio/gradle/gradle-2.10

2. locate the file gradle-wrapper.properties via Android Studio and change distributionUrl to: https\://services.gradle.org/distributions/gradle-2.10-all.zip

3. In Android Studio open File->Settings and change gradle location to the one you extracted the gradle 2.10 file.
Android studio - change gradle version

4. Open a terminal for the project folder (either by Android Studio terminal view or explicitly via command line tool) and type: gradlew {module.name}:build
Note: The parameter {module.name} is optional and intended for multi module project where you want to build certqin modules.

That's it.

After the build finishes, scroll via its output and you will see something similar to:
Wrote HTML report to file:///C:/dev/workspace/common/commonlibrary/build/outputs/lint-results.html
Wrote XML report to file:///C:/dev/workspace/common/commonlibrary/build/outputs/lint-results.xml

Open the html link and you would be pleasantly surprised.

Check it out:

Lint Html report - sample

IMHO, the generated report is visually better than the report being generated via Android Studio-> Code Inspection. It is very convenient to read the warnings/errors and correct.

Another advantage is that report will be generated on every build and not on demand.

Note:
The lint html report would be generated from command line and not via Android Studio default build.




Please include a link to the original post.